ML Research Hub
32.8K subscribers
4.43K photos
272 videos
23 files
4.78K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

📝 Summary:
This paper highlights the gap between SAM2 and SAM3. SAM2 uses spatial prompts for geometric segmentation, but SAM3 is a concept-driven multimodal model with a unified vision-language architecture. SAM3 represents a new class of foundation model for concept-driven segmentation.

🔹 Publication Date: Published on Dec 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.06032
• PDF: https://arxiv.org/pdf/2512.06032
• Github: https://github.com/Applied-AI-Research-Lab/The-SAM2-to-SAM3-Gap-in-the-Segment-Anything-Model-Family

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageSegmentation #FoundationModels #ComputerVision #MultimodalAI #AIResearch
1
Thinking with Images via Self-Calling Agent

📝 Summary:
sCoT is a novel visual reasoning paradigm that reformulates interleaved multimodal CoT as a language-only CoT with self-calling subagents. It improves reasoning performance and efficiency by avoiding explicit multimodal interleaving and using group-relative policy optimization.

🔹 Publication Date: Published on Dec 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.08511
• PDF: https://arxiv.org/pdf/2512.08511
• Github: https://github.com/YWenxi/think-with-images-through-self-calling

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisualReasoning #MultimodalAI #LLMs #AIagents #AIResearch
DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

📝 Summary:
DentalGPT is a specialized dental multimodal LLM. It improves fine-grained visual understanding and reasoning using a large dataset and reinforcement learning. DentalGPT achieves superior performance in dental disease classification and VQA.

🔹 Publication Date: Published on Dec 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.11558
• PDF: https://arxiv.org/pdf/2512.11558

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DentalGPT #DentistryAI #LLM #MultimodalAI #HealthcareTech
Media is too big
VIEW IN TELEGRAM
Agent S: An Open Agentic Framework that Uses Computers Like a Human

📝 Summary:
Agent S is an open agentic framework enabling autonomous GUI interaction to automate complex tasks. It employs experience-augmented hierarchical planning and an Agent-Computer Interface with MLLMs for enhanced reasoning. Agent S achieves state-of-the-art performance on OSWorld and demonstrates br...

🔹 Publication Date: Published on Oct 10, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2410.08164
• PDF: https://arxiv.org/pdf/2410.08164
• Github: https://huggingface.co/collections/ranpox/awesome-computer-use-agents

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AgenticAI #MultimodalAI #HumanComputerInteraction #Automation #AIResearch
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

📝 Summary:
MeViS is a multi-modal dataset for referring motion expression video segmentation, addressing the need to segment and track objects based on their motion descriptions. It provides text and audio annotations for complex videos, enabling research into motion-guided video understanding.

🔹 Publication Date: Published on Dec 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.10945
• PDF: https://arxiv.org/pdf/2512.10945
• Project Page: https://henghuiding.com/MeViS/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoSegmentation #MultiModalAI #ComputerVision #Dataset #MotionUnderstanding
2
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

📝 Summary:
MMRB2 is a new benchmark for multimodal reward models, evaluating them on interleaved image and text tasks using 4,000 expert-annotated preferences. It shows top models like Gemini 3 Pro achieve 75-80% accuracy, still below human performance, highlighting areas for improvement in these models.

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16899
• PDF: https://arxiv.org/pdf/2512.16899
• Github: https://github.com/facebookresearch/MMRB2/tree/main

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #RewardModels #AIbenchmark #MachineLearning #AIResearch
1
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📝 Summary:
This paper introduces LongShOTBench, a diagnostic benchmark for long-form multimodal video understanding with open-ended questions and agentic tool use. It also presents LongShOTAgent, an agentic system for video analysis. Results show state-of-the-art models struggle significantly, highlighting ...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16978
• PDF: https://arxiv.org/pdf/2512.16978
• Project Page: https://mbzuai-oryx.github.io/LongShOT/
• Github: https://github.com/mbzuai-oryx/longshot

Datasets citing this paper:
https://huggingface.co/datasets/MBZUAI/longshot-bench

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoAI #MultimodalAI #AgenticAI #AIbenchmark #AIResearch
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

📝 Summary:
CASA enhances cross-attention for vision-language models by adding local text-to-text interaction. This approach substantially reduces the performance gap with costly token insertion methods on detailed visual tasks. CASA maintains efficiency and scalability for long-context multimodal applicatio...

🔹 Publication Date: Published on Dec 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19535
• PDF: https://arxiv.org/pdf/2512.19535
• Project Page: https://kyutai.org/casa
• Github: https://github.com/kyutai-labs/casa

🔹 Models citing this paper:
https://huggingface.co/kyutai/CASA-Helium1-VL-2B

Spaces citing this paper:
https://huggingface.co/spaces/kyutai/casa-samples

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguage #MultimodalAI #AttentionMechanisms #EfficientAI #DeepLearning
4
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

📝 Summary:
T2AV-Compass introduces a unified benchmark for text-to-audio-video generation evaluation. It features 500 diverse prompts and a dual-level framework. Evaluations reveal current T2AV models struggle significantly with realism and cross-modal consistency.

🔹 Publication Date: Published on Dec 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21094
• PDF: https://arxiv.org/pdf/2512.21094
• Project Page: https://nju-link.github.io/T2AV-Compass/
• Github: https://github.com/NJU-LINK/T2AV-Compass/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#TextToAudioVideo #MultimodalAI #AIEvaluation #GenerativeAI #AIResearch
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

📝 Summary:
VideoRAG introduces the first RAG framework for long videos, using a dual-channel architecture to integrate textual knowledge grounding and multi-modal context encoding. This enables unlimited-length video processing and significantly outperforms existing methods.

🔹 Publication Date: Published on Feb 3

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.01549
• PDF: https://arxiv.org/pdf/2502.01549
• Github: https://github.com/hkuds/videorag

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoRAG #RAG #LongVideo #AI #MultimodalAI
2