ML Research Hub
32.9K subscribers
5.41K photos
339 videos
24 files
5.85K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Fast-SAM3D: 3Dfy Anything in Images but Faster

📝 Summary:
Fast-SAM3D addresses slow 3D reconstruction by dynamically adapting computation to varying complexity. It uses heterogeneity-aware mechanisms to achieve up to 2.67x faster inference with negligible quality loss, setting a new efficiency standard.

🔹 Publication Date: Published on Feb 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.05293
• PDF: https://arxiv.org/pdf/2602.05293
• Github: https://github.com/wlfeng0509/Fast-SAM3D

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DReconstruction #ComputerVision #DeepLearning #AI #Efficiency
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

📝 Summary:
PlanViz is a new benchmark evaluating unified multimodal models for image generation and editing in computer-use planning tasks. It features route planning, work diagramming, and web&UI displaying sub-tasks, using a task-adaptive PlanScore to assess correctness, visual quality, and efficiency.

🔹 Publication Date: Published on Feb 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.06663
• PDF: https://arxiv.org/pdf/2602.06663
• Project Page: https://github.com/lijunxian111/PlanViz
• Github: https://github.com/lijunxian111/PlanViz/releases/tag/v1

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #ImageGeneration #ImageEditing #ComputerVision #Benchmarking
Rethinking Global Text Conditioning in Diffusion Transformers

📝 Summary:
Conventional text conditioning pooled embedding in diffusion transformers offers little benefit alone. But, when used as training-free guidance for controllable generation, it significantly improves performance across text-to-image, video, and image editing tasks.

🔹 Publication Date: Published on Feb 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09268
• PDF: https://arxiv.org/pdf/2602.09268
• Github: https://github.com/quickjkee/modulation-guidance

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #GenerativeAI #AIResearch #ComputerVision #MachineLearning
Thinking with Drafting: Optical Decompression via Logical Reconstruction

📝 Summary:
Current AI struggles with precise visual reasoning. We propose Thinking with Drafting TwD, a DSL-based approach to decompress visual tokens into logical structures. This generates verifiable visual proofs, making visual generation a logical verifier for robust reasoning.

🔹 Publication Date: Published on Feb 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11731
• PDF: https://arxiv.org/pdf/2602.11731

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #VisualReasoning #ComputerVision #Logic #RobustAI
MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

📝 Summary:
MetaphorStar is an end-to-end visual reinforcement learning framework that solves AIs challenge in understanding image metaphors. It uses a new dataset, RL method, and benchmark. MetaphorStar achieves state-of-the-art performance, outperforming many MLLMs and improving general visual reasoning.

🔹 Publication Date: Published on Feb 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10575
• PDF: https://arxiv.org/pdf/2602.10575
• Project Page: https://metaphorstar.github.io/
• Github: https://github.com/MING-ZCH/MetaphorStar

🔹 Models citing this paper:
https://huggingface.co/MING-ZCH/MetaphorStar-32B
https://huggingface.co/MING-ZCH/MetaphorStar-3B
https://huggingface.co/MING-ZCH/MetaphorStar-7B

Datasets citing this paper:
https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Lite
https://huggingface.co/datasets/MING-ZCH/TFQ-Bench-Full
https://huggingface.co/datasets/MING-ZCH/TFQ-Data-Full

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #ReinforcementLearning #ComputerVision #ImageMetaphor #VisualReasoning
ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

📝 Summary:
ExStrucTiny is a new benchmark dataset for structured information extraction from document images. It addresses limitations of existing datasets by covering diverse document types and flexible schemas. This aims to improve generalist models for structured information extraction.

🔹 Publication Date: Published on Feb 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.12203
• PDF: https://arxiv.org/pdf/2602.12203

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#InformationExtraction #DocumentAI #MachineLearning #Dataset #ComputerVision
Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

📝 Summary:
Researchers created ASID-1M, a dataset of structured, quality-verified audiovisual instructions, and ASID-Captioner, a model trained on it. This improves fine-grained caption quality, reduces hallucinations, and achieves SOTA results.

🔹 Publication Date: Published on Feb 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13013
• PDF: https://arxiv.org/pdf/2602.13013
• Github: https://github.com/ASID-Caption/ASID-Caption

🔹 Models citing this paper:
https://huggingface.co/AudioVisual-Caption/ASID-Captioner-3B
https://huggingface.co/AudioVisual-Caption/ASID-Captioner-7B

Datasets citing this paper:
https://huggingface.co/datasets/AudioVisual-Caption/ASID-1M

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MLLM #VideoAI #DeepLearning #ComputerVision #NLP
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

📝 Summary:
MLLMs struggle with fine-grained perception due to latency from iterative zooming. Region-to-Image Distillation internalizes zooming into a single forward pass by training a model on region-grounded data. This significantly improves fine-grained perception without tool calls, achieving leading pe...

🔹 Publication Date: Published on Feb 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.11858
• PDF: https://arxiv.org/pdf/2602.11858
• Github: https://github.com/inclusionAI/Zooming-without-Zooming

🔹 Models citing this paper:
https://huggingface.co/inclusionAI/ZwZ-8B
https://huggingface.co/inclusionAI/ZwZ-4B
https://huggingface.co/inclusionAI/ZwZ-7B

Datasets citing this paper:
https://huggingface.co/datasets/inclusionAI/ZwZ-RL-VQA
https://huggingface.co/datasets/inclusionAI/ZoomBench

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #ComputerVision #FineGrainedPerception #DeepLearning #ModelDistillation
This media is not supported in your browser
VIEW IN TELEGRAM
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

📝 Summary:
OneVision-Encoder improves visual understanding by aligning architectures with video compression principles. It uses codec-aligned sparsity to focus on high-entropy regions, significantly boosting efficiency and accuracy. This method outperforms strong vision backbones across various benchmarks, ...

🔹 Publication Date: Published on Feb 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.08683
• PDF: https://arxiv.org/pdf/2602.08683
• Project Page: https://www.lmms-lab.com/onevision-encoder/index.html
• Github: https://github.com/EvolvingLMMs-Lab/OneVision-Encoder/blob/main/docs/data_card.md

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #ComputerVision #DeepLearning #Sparsity #AIResearch
Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels

📝 Summary:
This study trains deep learning models to segment individual tree crowns from aerial imagery. It uses enhanced pseudo-labels derived from ALS data, improved by SAM 2, to eliminate manual annotation. This method produces superior, domain-specific segmentation models.

🔹 Publication Date: Published on Feb 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.13022
• PDF: https://arxiv.org/pdf/2602.13022

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DeepLearning #ImageSegmentation #RemoteSensing #Forestry #ComputerVision
SemanticMoments: Training-Free Motion Similarity via Third Moment Features

📝 Summary:
Existing video models struggle with semantic motion often biased towards appearance. SemanticMoments addresses this with a training-free method using temporal statistics on semantic features to consistently outperform other approaches for motion-centric video understanding.

🔹 Publication Date: Published on Feb 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.09146
• PDF: https://arxiv.org/pdf/2602.09146
• Project Page: https://x.com/HubermanSaar/status/2023485404280672498?s=20

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SemanticMoments #VideoUnderstanding #ComputerVision #MachineLearning #MotionAnalysis
1
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

📝 Summary:
DeepImageSearch introduces an agentic image retrieval paradigm that enables multi-step reasoning over visual histories, moving beyond isolated semantic matching. It uses contextual cues for autonomous exploration. The DISBench benchmark shows current models struggle, proving agentic reasoning is ...

🔹 Publication Date: Published on Feb 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.10809
• PDF: https://arxiv.org/pdf/2602.10809
• Github: https://github.com/RUC-NLPIR/DeepImageSearch

Spaces citing this paper:
https://huggingface.co/spaces/RUC-NLPIR/DISBench-Leaderboard

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageRetrieval #AgenticAI #MultimodalAI #ComputerVision #AIResearch
StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

📝 Summary:
StereoAdapter-2 improves underwater stereo depth estimation by replacing ConvGRU with a ConvSS2D operator for efficient, long-range disparity propagation. It also introduces UW-StereoDepth-80K, a new large-scale synthetic dataset. This approach achieves state-of-the-art zero-shot performance on u...

🔹 Publication Date: Published on Feb 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.16915
• PDF: https://arxiv.org/pdf/2602.16915
• Project Page: https://aigeeksgroup.github.io/StereoAdapter-2
• Github: https://aigeeksgroup.github.io/StereoAdapter-2

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#UnderwaterAI #ComputerVision #DeepLearning #StereoVision #Dataset
1