ML Research Hub
32.5K subscribers
5.85K photos
374 videos
24 files
6.32K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

📝 Summary:
Vision-R1 is a reasoning MLLM enhancing multimodal reasoning via Reinforcement Learning. It leverages a large, AI-generated multimodal CoT dataset and new training strategies to refine reasoning. This achieves high accuracy on multimodal math benchmarks.

🔹 Publication Date: Published on Mar 9, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2503.06749
• PDF: https://arxiv.org/pdf/2503.06749
• Github: https://github.com/Osilly/Vision-R1

Datasets citing this paper:
https://huggingface.co/datasets/Yuting6/ttrl
https://huggingface.co/datasets/LoadingBFX/GeoQA-PLUS-aug-train-Vision-R1-cot-rewrite
https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MLLM #ReinforcementLearning #AIReasoning #ChainOfThought #ArtificialIntelligence
1
LMEB: Long-horizon Memory Embedding Benchmark

📝 Summary:
LMEB is a new benchmark for evaluating embedding models' long-horizon memory retrieval abilities, a gap in traditional benchmarks. It assesses complex memory types and reveals that performance in standard passage retrieval does not generalize to these challenging scenarios.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12572
• PDF: https://arxiv.org/pdf/2603.12572

Datasets citing this paper:
https://huggingface.co/datasets/KaLM-Embedding/LMEB

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#EmbeddingModels #MemoryRetrieval #Benchmarks #MachineLearning #AIResearch
VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

📝 Summary:
VQQA is a multi-agent framework that uses VLM critiques as semantic gradients for efficient, black-box video generation optimization via natural language. It resolves visual artifacts, significantly improving video quality for text-to-video and image-to-video tasks.

🔹 Publication Date: Published on Mar 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12310
• PDF: https://arxiv.org/pdf/2603.12310
• Project Page: https://yiwen-song.github.io/vqqa/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoGeneration #AIAgents #VisionLanguageModels #GenerativeAI #MachineLearning
daVinci-Env: Open SWE Environment Synthesis at Scale

📝 Summary:
OpenSWE is the largest open framework for training software engineering agents, featuring 45,320 executable Python environments. It achieves state-of-the-art performance on SWE-bench Verified and shows substantial out-of-domain reasoning improvements.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13023
• PDF: https://arxiv.org/pdf/2603.13023
• Github: https://github.com/GAIR-NLP/OpenSWE

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SoftwareEngineering #AIagents #MachineLearning #OpenSWE #DeepLearning
From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

📝 Summary:
Multi-View GRPO enhances text-to-image flow model alignment by expanding condition space for richer reward mapping and improved sample relationship exploration. AI-generated summary Group Relative Pol...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12648
• PDF: https://arxiv.org/pdf/2603.12648
• Project Page: https://bujiazi.github.io/mvgrpo.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents

📝 Summary:
The Budget-Aware Value Tree BAVT optimizes LLM agent reasoning by dynamically balancing exploration and exploitation based on remaining compute. It uses a budget-conditioned node selection and residual value predictor for efficient search, outperforming brute-force methods with 4x less resources.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12634
• PDF: https://arxiv.org/pdf/2603.12634

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#LLMAgents #AIResearch #Optimization #EfficientAI #ValueTreeSearch
2
Visual-ERM: Reward Modeling for Visual Equivalence

📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...

🔹 Publication Date: Published on Mar 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

📝 Summary:
LookaheadKV enhances KV cache eviction in LLMs by accurately predicting future importance scores. It uses parameter-efficient modules, avoiding costly draft generation while maintaining high accuracy. This lightweight method significantly reduces eviction overhead and speeds up inference.

🔹 Publication Date: Published on Mar 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10899
• PDF: https://arxiv.org/pdf/2603.10899
• Github: https://github.com/SamsungLabs/LookaheadKV

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#LLM #KVCache #ModelOptimization #DeepLearning #AI
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers

🔹 Models citing this paper:
https://huggingface.co/ai9stars/Cheers

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

📝 Summary:
OmniForcing transforms slow bidirectional audio-visual diffusion models into fast, real-time streaming generators. It tackles training instability and synchronization by using asymmetric alignment, a global prefix, and an audio sink token. This enables high-fidelity, synchronized generation at 25...

🔹 Publication Date: Published on Mar 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11647
• PDF: https://arxiv.org/pdf/2603.11647
• Project Page: https://omniforcing.com/
• Github: https://github.com/OmniForcing/OmniForcing

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#GenerativeAI #AudioVisual #RealtimeAI #DiffusionModels #DeepLearning
HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

📝 Summary:
HybridStitch accelerates text-to-image generation by intelligently combining large and small diffusion models. It uses the large model for complex image regions and the smaller model for simpler parts, even within a single denoising step. This approach speeds up generation by 1.83x on Stable Diff...

🔹 Publication Date: Published on Mar 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.07815
• PDF: https://arxiv.org/pdf/2603.07815

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

📝 Summary:
MM-CondChain benchmark evaluates multimodal large language models on deep compositional visual reasoning through multi-layer conditional workflows with mechanically verifiable conditions. AI-generated...

🔹 Publication Date: Published on Mar 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12266
• PDF: https://arxiv.org/pdf/2603.12266
• Project Page: https://accio-lab.github.io/MM-CondChain
• Github: https://accio-lab.github.io/MM-CondChain

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

📝 Summary:
A self-evolving framework for open-world embodied agents that couples execution diagnosis with knowledge distillation to improve long-horizon task performance through structured experience organizatio...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13131
• PDF: https://arxiv.org/pdf/2603.13131
• Github: https://github.com/xzw-ustc/Steve-Evolving

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

📝 Summary:
Video generative models can be adapted for image restoration tasks with minimal training data by treating restoration as a progressive generative process. AI-generated summary Large-scale video genera...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13089
• PDF: https://arxiv.org/pdf/2603.13089
• Project Page: https://zhengsh123.github.io/V-Bridge/
• Github: https://github.com/Zhengsh123/V-Bridge

🔹 Models citing this paper:
https://huggingface.co/desimfj/V-Bridge

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Multimodal OCR: Parse Anything from Documents

📝 Summary:
MOCR is a multimodal OCR approach that jointly parses text and graphics into unified representations, enabling structured document reconstruction and supporting end-to-end training with semantic relat...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13032
• PDF: https://arxiv.org/pdf/2603.13032

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

📝 Summary:
A novel detection framework called UCIP uses quantum statistical mechanics-inspired methods to distinguish between autonomous agents with genuine continuation objectives versus those pursuing continua...

🔹 Publication Date: Published on Mar 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11382
• PDF: https://arxiv.org/pdf/2603.11382
• Project Page: https://lab.christopheraltman.com/
• Github: https://github.com/christopher-altman/persistence-signal-detector

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
This media is not supported in your browser
VIEW IN TELEGRAM
Can Vision-Language Models Solve the Shell Game?

📝 Summary:
Vision-Language Models struggle with tracking identical visual entities, performing poorly on the VET-Bench testbed. Researchers propose Spatiotemporal Grounded Chain-of-Thought SGCoT to generate object trajectories as intermediate states. This method achieves over 90% accuracy, showing VLMs can ...

🔹 Publication Date: Published on Mar 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08436
• PDF: https://arxiv.org/pdf/2603.08436
• Project Page: https://vetbench.github.io/
• Github: https://github.com/liutiedong/shellgame

🔹 Models citing this paper:
https://huggingface.co/tiedong/Molmo2-SGCoT

Datasets citing this paper:
https://huggingface.co/datasets/tiedong/vetbench
https://huggingface.co/datasets/tiedong/Molmo2-SGCoT

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
This media is not supported in your browser
VIEW IN TELEGRAM
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

📝 Summary:
HomeSafe-Bench presents a benchmark for vision-language models to detect unsafe actions by embodied agents in household settings. It also introduces HD-Guard, a hierarchical dual-brain architecture balancing real-time safety monitoring with detection accuracy.

🔹 Publication Date: Published on Mar 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11975
• PDF: https://arxiv.org/pdf/2603.11975
• Project Page: https://pujiayue.github.io/homesafe-bench.github.io/
• Github: https://github.com/pujiayue/HomeSafe-Bench

Spaces citing this paper:
https://huggingface.co/spaces/pujiayue/HomeSafe-Bench-Leaderboard

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #EmbodiedAI #AISafety #Robotics #Benchmark
1
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

📝 Summary:
NanoVDR improves visual document retrieval by distilling a large VLM teacher into a small 70M text-only query encoder. This decouples document indexing from query processing, achieving 50x lower latency and 32x fewer parameters with nearly identical quality.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12824
• PDF: https://arxiv.org/pdf/2603.12824
• Project Page: https://huggingface.co/nanovdr

🔹 Models citing this paper:
https://huggingface.co/nanovdr/NanoVDR-L
https://huggingface.co/nanovdr/NanoVDR-S-Multi
https://huggingface.co/nanovdr/NanoVDR-S

Spaces citing this paper:
https://huggingface.co/spaces/nanovdr/NanoVDR-Demo

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisualDocumentRetrieval #ModelDistillation #VLM #InformationRetrieval #DeepLearning
1
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.

🔹 Publication Date: Published on Mar 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP