✨Fast-SAM3D: 3Dfy Anything in Images but Faster
📝 Summary:
Fast-SAM3D addresses slow 3D reconstruction by dynamically adapting computation to varying complexity. It uses heterogeneity-aware mechanisms to achieve up to 2.67x faster inference with negligible quality loss, setting a new efficiency standard.
🔹 Publication Date: Published on Feb 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.05293
• PDF: https://arxiv.org/pdf/2602.05293
• Github: https://github.com/wlfeng0509/Fast-SAM3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #DeepLearning #AI #Efficiency
📝 Summary:
Fast-SAM3D addresses slow 3D reconstruction by dynamically adapting computation to varying complexity. It uses heterogeneity-aware mechanisms to achieve up to 2.67x faster inference with negligible quality loss, setting a new efficiency standard.
🔹 Publication Date: Published on Feb 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.05293
• PDF: https://arxiv.org/pdf/2602.05293
• Github: https://github.com/wlfeng0509/Fast-SAM3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #DeepLearning #AI #Efficiency
✨PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
📝 Summary:
PlanViz is a new benchmark evaluating unified multimodal models for image generation and editing in computer-use planning tasks. It features route planning, work diagramming, and web&UI displaying sub-tasks, using a task-adaptive PlanScore to assess correctness, visual quality, and efficiency.
🔹 Publication Date: Published on Feb 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.06663
• PDF: https://arxiv.org/pdf/2602.06663
• Project Page: https://github.com/lijunxian111/PlanViz
• Github: https://github.com/lijunxian111/PlanViz/releases/tag/v1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ImageGeneration #ImageEditing #ComputerVision #Benchmarking
📝 Summary:
PlanViz is a new benchmark evaluating unified multimodal models for image generation and editing in computer-use planning tasks. It features route planning, work diagramming, and web&UI displaying sub-tasks, using a task-adaptive PlanScore to assess correctness, visual quality, and efficiency.
🔹 Publication Date: Published on Feb 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.06663
• PDF: https://arxiv.org/pdf/2602.06663
• Project Page: https://github.com/lijunxian111/PlanViz
• Github: https://github.com/lijunxian111/PlanViz/releases/tag/v1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ImageGeneration #ImageEditing #ComputerVision #Benchmarking
✨Rethinking Global Text Conditioning in Diffusion Transformers
📝 Summary:
Conventional text conditioning pooled embedding in diffusion transformers offers little benefit alone. But, when used as training-free guidance for controllable generation, it significantly improves performance across text-to-image, video, and image editing tasks.
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09268
• PDF: https://arxiv.org/pdf/2602.09268
• Github: https://github.com/quickjkee/modulation-guidance
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #GenerativeAI #AIResearch #ComputerVision #MachineLearning
📝 Summary:
Conventional text conditioning pooled embedding in diffusion transformers offers little benefit alone. But, when used as training-free guidance for controllable generation, it significantly improves performance across text-to-image, video, and image editing tasks.
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09268
• PDF: https://arxiv.org/pdf/2602.09268
• Github: https://github.com/quickjkee/modulation-guidance
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #GenerativeAI #AIResearch #ComputerVision #MachineLearning
✨Thinking with Drafting: Optical Decompression via Logical Reconstruction
📝 Summary:
Current AI struggles with precise visual reasoning. We propose Thinking with Drafting TwD, a DSL-based approach to decompress visual tokens into logical structures. This generates verifiable visual proofs, making visual generation a logical verifier for robust reasoning.
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11731
• PDF: https://arxiv.org/pdf/2602.11731
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VisualReasoning #ComputerVision #Logic #RobustAI
📝 Summary:
Current AI struggles with precise visual reasoning. We propose Thinking with Drafting TwD, a DSL-based approach to decompress visual tokens into logical structures. This generates verifiable visual proofs, making visual generation a logical verifier for robust reasoning.
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11731
• PDF: https://arxiv.org/pdf/2602.11731
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VisualReasoning #ComputerVision #Logic #RobustAI
✨MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning
📝 Summary:
MetaphorStar is an end-to-end visual reinforcement learning framework that solves AIs challenge in understanding image metaphors. It uses a new dataset, RL method, and benchmark. MetaphorStar achieves state-of-the-art performance, outperforming many MLLMs and improving general visual reasoning.
🔹 Publication Date: Published on Feb 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10575
• PDF: https://arxiv.org/pdf/2602.10575
• Project Page: https://metaphorstar.github.io/
• Github: https://github.com/MING-ZCH/MetaphorStar
🔹 Models citing this paper:
• https://huggingface.co/MING-ZCH/MetaphorStar-32B
• https://huggingface.co/MING-ZCH/MetaphorStar-3B
• https://huggingface.co/MING-ZCH/MetaphorStar-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Lite
• https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Full
• https://huggingface.co/datasets/MING-ZCH/TFQ-Data-Full
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ReinforcementLearning #ComputerVision #ImageMetaphor #VisualReasoning
📝 Summary:
MetaphorStar is an end-to-end visual reinforcement learning framework that solves AIs challenge in understanding image metaphors. It uses a new dataset, RL method, and benchmark. MetaphorStar achieves state-of-the-art performance, outperforming many MLLMs and improving general visual reasoning.
🔹 Publication Date: Published on Feb 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10575
• PDF: https://arxiv.org/pdf/2602.10575
• Project Page: https://metaphorstar.github.io/
• Github: https://github.com/MING-ZCH/MetaphorStar
🔹 Models citing this paper:
• https://huggingface.co/MING-ZCH/MetaphorStar-32B
• https://huggingface.co/MING-ZCH/MetaphorStar-3B
• https://huggingface.co/MING-ZCH/MetaphorStar-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Lite
• https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Full
• https://huggingface.co/datasets/MING-ZCH/TFQ-Data-Full
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ReinforcementLearning #ComputerVision #ImageMetaphor #VisualReasoning
arXiv.org
MetaphorStar: Image Metaphor Understanding and Reasoning with...
Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they...
✨ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
📝 Summary:
ExStrucTiny is a new benchmark dataset for structured information extraction from document images. It addresses limitations of existing datasets by covering diverse document types and flexible schemas. This aims to improve generalist models for structured information extraction.
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.12203
• PDF: https://arxiv.org/pdf/2602.12203
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#InformationExtraction #DocumentAI #MachineLearning #Dataset #ComputerVision
📝 Summary:
ExStrucTiny is a new benchmark dataset for structured information extraction from document images. It addresses limitations of existing datasets by covering diverse document types and flexible schemas. This aims to improve generalist models for structured information extraction.
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.12203
• PDF: https://arxiv.org/pdf/2602.12203
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#InformationExtraction #DocumentAI #MachineLearning #Dataset #ComputerVision
✨Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions
📝 Summary:
Researchers created ASID-1M, a dataset of structured, quality-verified audiovisual instructions, and ASID-Captioner, a model trained on it. This improves fine-grained caption quality, reduces hallucinations, and achieves SOTA results.
🔹 Publication Date: Published on Feb 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13013
• PDF: https://arxiv.org/pdf/2602.13013
• Github: https://github.com/ASID-Caption/ASID-Caption
🔹 Models citing this paper:
• https://huggingface.co/AudioVisual-Caption/ASID-Captioner-3B
• https://huggingface.co/AudioVisual-Caption/ASID-Captioner-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/AudioVisual-Caption/ASID-1M
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLM #VideoAI #DeepLearning #ComputerVision #NLP
📝 Summary:
Researchers created ASID-1M, a dataset of structured, quality-verified audiovisual instructions, and ASID-Captioner, a model trained on it. This improves fine-grained caption quality, reduces hallucinations, and achieves SOTA results.
🔹 Publication Date: Published on Feb 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13013
• PDF: https://arxiv.org/pdf/2602.13013
• Github: https://github.com/ASID-Caption/ASID-Caption
🔹 Models citing this paper:
• https://huggingface.co/AudioVisual-Caption/ASID-Captioner-3B
• https://huggingface.co/AudioVisual-Caption/ASID-Captioner-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/AudioVisual-Caption/ASID-1M
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLM #VideoAI #DeepLearning #ComputerVision #NLP
✨Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
📝 Summary:
MLLMs struggle with fine-grained perception due to latency from iterative zooming. Region-to-Image Distillation internalizes zooming into a single forward pass by training a model on region-grounded data. This significantly improves fine-grained perception without tool calls, achieving leading pe...
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11858
• PDF: https://arxiv.org/pdf/2602.11858
• Github: https://github.com/inclusionAI/Zooming-without-Zooming
🔹 Models citing this paper:
• https://huggingface.co/inclusionAI/ZwZ-8B
• https://huggingface.co/inclusionAI/ZwZ-4B
• https://huggingface.co/inclusionAI/ZwZ-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/inclusionAI/ZwZ-RL-VQA
• https://huggingface.co/datasets/inclusionAI/ZoomBench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #FineGrainedPerception #DeepLearning #ModelDistillation
📝 Summary:
MLLMs struggle with fine-grained perception due to latency from iterative zooming. Region-to-Image Distillation internalizes zooming into a single forward pass by training a model on region-grounded data. This significantly improves fine-grained perception without tool calls, achieving leading pe...
🔹 Publication Date: Published on Feb 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11858
• PDF: https://arxiv.org/pdf/2602.11858
• Github: https://github.com/inclusionAI/Zooming-without-Zooming
🔹 Models citing this paper:
• https://huggingface.co/inclusionAI/ZwZ-8B
• https://huggingface.co/inclusionAI/ZwZ-4B
• https://huggingface.co/inclusionAI/ZwZ-7B
✨ Datasets citing this paper:
• https://huggingface.co/datasets/inclusionAI/ZwZ-RL-VQA
• https://huggingface.co/datasets/inclusionAI/ZoomBench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #FineGrainedPerception #DeepLearning #ModelDistillation
arXiv.org
Zooming without Zooming: Region-to-Image Distillation for...
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global...
This media is not supported in your browser
VIEW IN TELEGRAM
✨OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
📝 Summary:
OneVision-Encoder improves visual understanding by aligning architectures with video compression principles. It uses codec-aligned sparsity to focus on high-entropy regions, significantly boosting efficiency and accuracy. This method outperforms strong vision backbones across various benchmarks, ...
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.08683
• PDF: https://arxiv.org/pdf/2602.08683
• Project Page: https://www.lmms-lab.com/onevision-encoder/index.html
• Github: https://github.com/EvolvingLMMs-Lab/OneVision-Encoder/blob/main/docs/data_card.md
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #Sparsity #AIResearch
📝 Summary:
OneVision-Encoder improves visual understanding by aligning architectures with video compression principles. It uses codec-aligned sparsity to focus on high-entropy regions, significantly boosting efficiency and accuracy. This method outperforms strong vision backbones across various benchmarks, ...
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.08683
• PDF: https://arxiv.org/pdf/2602.08683
• Project Page: https://www.lmms-lab.com/onevision-encoder/index.html
• Github: https://github.com/EvolvingLMMs-Lab/OneVision-Encoder/blob/main/docs/data_card.md
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #Sparsity #AIResearch
✨Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels
📝 Summary:
This study trains deep learning models to segment individual tree crowns from aerial imagery. It uses enhanced pseudo-labels derived from ALS data, improved by SAM 2, to eliminate manual annotation. This method produces superior, domain-specific segmentation models.
🔹 Publication Date: Published on Feb 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13022
• PDF: https://arxiv.org/pdf/2602.13022
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DeepLearning #ImageSegmentation #RemoteSensing #Forestry #ComputerVision
📝 Summary:
This study trains deep learning models to segment individual tree crowns from aerial imagery. It uses enhanced pseudo-labels derived from ALS data, improved by SAM 2, to eliminate manual annotation. This method produces superior, domain-specific segmentation models.
🔹 Publication Date: Published on Feb 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13022
• PDF: https://arxiv.org/pdf/2602.13022
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DeepLearning #ImageSegmentation #RemoteSensing #Forestry #ComputerVision
✨SemanticMoments: Training-Free Motion Similarity via Third Moment Features
📝 Summary:
Existing video models struggle with semantic motion often biased towards appearance. SemanticMoments addresses this with a training-free method using temporal statistics on semantic features to consistently outperform other approaches for motion-centric video understanding.
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09146
• PDF: https://arxiv.org/pdf/2602.09146
• Project Page: https://x.com/HubermanSaar/status/2023485404280672498?s=20
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SemanticMoments #VideoUnderstanding #ComputerVision #MachineLearning #MotionAnalysis
📝 Summary:
Existing video models struggle with semantic motion often biased towards appearance. SemanticMoments addresses this with a training-free method using temporal statistics on semantic features to consistently outperform other approaches for motion-centric video understanding.
🔹 Publication Date: Published on Feb 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09146
• PDF: https://arxiv.org/pdf/2602.09146
• Project Page: https://x.com/HubermanSaar/status/2023485404280672498?s=20
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SemanticMoments #VideoUnderstanding #ComputerVision #MachineLearning #MotionAnalysis
❤1
✨DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
📝 Summary:
DeepImageSearch introduces an agentic image retrieval paradigm that enables multi-step reasoning over visual histories, moving beyond isolated semantic matching. It uses contextual cues for autonomous exploration. The DISBench benchmark shows current models struggle, proving agentic reasoning is ...
🔹 Publication Date: Published on Feb 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10809
• PDF: https://arxiv.org/pdf/2602.10809
• Github: https://github.com/RUC-NLPIR/DeepImageSearch
✨ Spaces citing this paper:
• https://huggingface.co/spaces/RUC-NLPIR/DISBench-Leaderboard
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageRetrieval #AgenticAI #MultimodalAI #ComputerVision #AIResearch
📝 Summary:
DeepImageSearch introduces an agentic image retrieval paradigm that enables multi-step reasoning over visual histories, moving beyond isolated semantic matching. It uses contextual cues for autonomous exploration. The DISBench benchmark shows current models struggle, proving agentic reasoning is ...
🔹 Publication Date: Published on Feb 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10809
• PDF: https://arxiv.org/pdf/2602.10809
• Github: https://github.com/RUC-NLPIR/DeepImageSearch
✨ Spaces citing this paper:
• https://huggingface.co/spaces/RUC-NLPIR/DISBench-Leaderboard
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageRetrieval #AgenticAI #MultimodalAI #ComputerVision #AIResearch
✨StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation
📝 Summary:
StereoAdapter-2 improves underwater stereo depth estimation by replacing ConvGRU with a ConvSS2D operator for efficient, long-range disparity propagation. It also introduces UW-StereoDepth-80K, a new large-scale synthetic dataset. This approach achieves state-of-the-art zero-shot performance on u...
🔹 Publication Date: Published on Feb 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.16915
• PDF: https://arxiv.org/pdf/2602.16915
• Project Page: https://aigeeksgroup.github.io/StereoAdapter-2
• Github: https://aigeeksgroup.github.io/StereoAdapter-2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#UnderwaterAI #ComputerVision #DeepLearning #StereoVision #Dataset
📝 Summary:
StereoAdapter-2 improves underwater stereo depth estimation by replacing ConvGRU with a ConvSS2D operator for efficient, long-range disparity propagation. It also introduces UW-StereoDepth-80K, a new large-scale synthetic dataset. This approach achieves state-of-the-art zero-shot performance on u...
🔹 Publication Date: Published on Feb 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.16915
• PDF: https://arxiv.org/pdf/2602.16915
• Project Page: https://aigeeksgroup.github.io/StereoAdapter-2
• Github: https://aigeeksgroup.github.io/StereoAdapter-2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#UnderwaterAI #ComputerVision #DeepLearning #StereoVision #Dataset
❤1