✨Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
📝 Summary:
Vision-R1 is a reasoning MLLM enhancing multimodal reasoning via Reinforcement Learning. It leverages a large, AI-generated multimodal CoT dataset and new training strategies to refine reasoning. This achieves high accuracy on multimodal math benchmarks.
🔹 Publication Date: Published on Mar 9, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2503.06749
• PDF: https://arxiv.org/pdf/2503.06749
• Github: https://github.com/Osilly/Vision-R1
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Yuting6/ttrl
• https://huggingface.co/datasets/LoadingBFX/GeoQA-PLUS-aug-train-Vision-R1-cot-rewrite
• https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLM #ReinforcementLearning #AIReasoning #ChainOfThought #ArtificialIntelligence
📝 Summary:
Vision-R1 is a reasoning MLLM enhancing multimodal reasoning via Reinforcement Learning. It leverages a large, AI-generated multimodal CoT dataset and new training strategies to refine reasoning. This achieves high accuracy on multimodal math benchmarks.
🔹 Publication Date: Published on Mar 9, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2503.06749
• PDF: https://arxiv.org/pdf/2503.06749
• Github: https://github.com/Osilly/Vision-R1
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Yuting6/ttrl
• https://huggingface.co/datasets/LoadingBFX/GeoQA-PLUS-aug-train-Vision-R1-cot-rewrite
• https://huggingface.co/datasets/LoadingBFX/GeoQA-train-Vision-R1-cot-rewrite
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLM #ReinforcementLearning #AIReasoning #ChainOfThought #ArtificialIntelligence
❤1
✨LMEB: Long-horizon Memory Embedding Benchmark
📝 Summary:
LMEB is a new benchmark for evaluating embedding models' long-horizon memory retrieval abilities, a gap in traditional benchmarks. It assesses complex memory types and reveals that performance in standard passage retrieval does not generalize to these challenging scenarios.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12572
• PDF: https://arxiv.org/pdf/2603.12572
✨ Datasets citing this paper:
• https://huggingface.co/datasets/KaLM-Embedding/LMEB
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EmbeddingModels #MemoryRetrieval #Benchmarks #MachineLearning #AIResearch
📝 Summary:
LMEB is a new benchmark for evaluating embedding models' long-horizon memory retrieval abilities, a gap in traditional benchmarks. It assesses complex memory types and reveals that performance in standard passage retrieval does not generalize to these challenging scenarios.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12572
• PDF: https://arxiv.org/pdf/2603.12572
✨ Datasets citing this paper:
• https://huggingface.co/datasets/KaLM-Embedding/LMEB
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EmbeddingModels #MemoryRetrieval #Benchmarks #MachineLearning #AIResearch
✨VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
📝 Summary:
VQQA is a multi-agent framework that uses VLM critiques as semantic gradients for efficient, black-box video generation optimization via natural language. It resolves visual artifacts, significantly improving video quality for text-to-video and image-to-video tasks.
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12310
• PDF: https://arxiv.org/pdf/2603.12310
• Project Page: https://yiwen-song.github.io/vqqa/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #AIAgents #VisionLanguageModels #GenerativeAI #MachineLearning
📝 Summary:
VQQA is a multi-agent framework that uses VLM critiques as semantic gradients for efficient, black-box video generation optimization via natural language. It resolves visual artifacts, significantly improving video quality for text-to-video and image-to-video tasks.
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12310
• PDF: https://arxiv.org/pdf/2603.12310
• Project Page: https://yiwen-song.github.io/vqqa/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #AIAgents #VisionLanguageModels #GenerativeAI #MachineLearning
✨daVinci-Env: Open SWE Environment Synthesis at Scale
📝 Summary:
OpenSWE is the largest open framework for training software engineering agents, featuring 45,320 executable Python environments. It achieves state-of-the-art performance on SWE-bench Verified and shows substantial out-of-domain reasoning improvements.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13023
• PDF: https://arxiv.org/pdf/2603.13023
• Github: https://github.com/GAIR-NLP/OpenSWE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SoftwareEngineering #AIagents #MachineLearning #OpenSWE #DeepLearning
📝 Summary:
OpenSWE is the largest open framework for training software engineering agents, featuring 45,320 executable Python environments. It achieves state-of-the-art performance on SWE-bench Verified and shows substantial out-of-domain reasoning improvements.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13023
• PDF: https://arxiv.org/pdf/2603.13023
• Github: https://github.com/GAIR-NLP/OpenSWE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SoftwareEngineering #AIagents #MachineLearning #OpenSWE #DeepLearning
✨From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space
📝 Summary:
Multi-View GRPO enhances text-to-image flow model alignment by expanding condition space for richer reward mapping and improved sample relationship exploration. AI-generated summary Group Relative Pol...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12648
• PDF: https://arxiv.org/pdf/2603.12648
• Project Page: https://bujiazi.github.io/mvgrpo.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
Multi-View GRPO enhances text-to-image flow model alignment by expanding condition space for richer reward mapping and improved sample relationship exploration. AI-generated summary Group Relative Pol...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12648
• PDF: https://arxiv.org/pdf/2603.12648
• Project Page: https://bujiazi.github.io/mvgrpo.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents
📝 Summary:
The Budget-Aware Value Tree BAVT optimizes LLM agent reasoning by dynamically balancing exploration and exploitation based on remaining compute. It uses a budget-conditioned node selection and residual value predictor for efficient search, outperforming brute-force methods with 4x less resources.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12634
• PDF: https://arxiv.org/pdf/2603.12634
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLMAgents #AIResearch #Optimization #EfficientAI #ValueTreeSearch
📝 Summary:
The Budget-Aware Value Tree BAVT optimizes LLM agent reasoning by dynamically balancing exploration and exploitation based on remaining compute. It uses a budget-conditioned node selection and residual value predictor for efficient search, outperforming brute-force methods with 4x less resources.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12634
• PDF: https://arxiv.org/pdf/2603.12634
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLMAgents #AIResearch #Optimization #EfficientAI #ValueTreeSearch
❤2
✨Visual-ERM: Reward Modeling for Visual Equivalence
📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
✨SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...
🔹 Publication Date: Published on Mar 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...
🔹 Publication Date: Published on Mar 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
✨LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
📝 Summary:
LookaheadKV enhances KV cache eviction in LLMs by accurately predicting future importance scores. It uses parameter-efficient modules, avoiding costly draft generation while maintaining high accuracy. This lightweight method significantly reduces eviction overhead and speeds up inference.
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10899
• PDF: https://arxiv.org/pdf/2603.10899
• Github: https://github.com/SamsungLabs/LookaheadKV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLM #KVCache #ModelOptimization #DeepLearning #AI
📝 Summary:
LookaheadKV enhances KV cache eviction in LLMs by accurately predicting future importance scores. It uses parameter-efficient modules, avoiding costly draft generation while maintaining high accuracy. This lightweight method significantly reduces eviction overhead and speeds up inference.
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10899
• PDF: https://arxiv.org/pdf/2603.10899
• Github: https://github.com/SamsungLabs/LookaheadKV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LLM #KVCache #ModelOptimization #DeepLearning #AI
✨Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers
🔹 Models citing this paper:
• https://huggingface.co/ai9stars/Cheers
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers
🔹 Models citing this paper:
• https://huggingface.co/ai9stars/Cheers
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
✨OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
📝 Summary:
OmniForcing transforms slow bidirectional audio-visual diffusion models into fast, real-time streaming generators. It tackles training instability and synchronization by using asymmetric alignment, a global prefix, and an audio sink token. This enables high-fidelity, synchronized generation at 25...
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11647
• PDF: https://arxiv.org/pdf/2603.11647
• Project Page: https://omniforcing.com/
• Github: https://github.com/OmniForcing/OmniForcing
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GenerativeAI #AudioVisual #RealtimeAI #DiffusionModels #DeepLearning
📝 Summary:
OmniForcing transforms slow bidirectional audio-visual diffusion models into fast, real-time streaming generators. It tackles training instability and synchronization by using asymmetric alignment, a global prefix, and an audio sink token. This enables high-fidelity, synchronized generation at 25...
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11647
• PDF: https://arxiv.org/pdf/2603.11647
• Project Page: https://omniforcing.com/
• Github: https://github.com/OmniForcing/OmniForcing
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GenerativeAI #AudioVisual #RealtimeAI #DiffusionModels #DeepLearning
✨HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration
📝 Summary:
HybridStitch accelerates text-to-image generation by intelligently combining large and small diffusion models. It uses the large model for complex image regions and the smaller model for simpler parts, even within a single denoising step. This approach speeds up generation by 1.83x on Stable Diff...
🔹 Publication Date: Published on Mar 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.07815
• PDF: https://arxiv.org/pdf/2603.07815
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
HybridStitch accelerates text-to-image generation by intelligently combining large and small diffusion models. It uses the large model for complex image regions and the smaller model for simpler parts, even within a single denoising step. This approach speeds up generation by 1.83x on Stable Diff...
🔹 Publication Date: Published on Mar 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.07815
• PDF: https://arxiv.org/pdf/2603.07815
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
📝 Summary:
MM-CondChain benchmark evaluates multimodal large language models on deep compositional visual reasoning through multi-layer conditional workflows with mechanically verifiable conditions. AI-generated...
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12266
• PDF: https://arxiv.org/pdf/2603.12266
• Project Page: https://accio-lab.github.io/MM-CondChain
• Github: https://accio-lab.github.io/MM-CondChain
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
MM-CondChain benchmark evaluates multimodal large language models on deep compositional visual reasoning through multi-layer conditional workflows with mechanically verifiable conditions. AI-generated...
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12266
• PDF: https://arxiv.org/pdf/2603.12266
• Project Page: https://accio-lab.github.io/MM-CondChain
• Github: https://accio-lab.github.io/MM-CondChain
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation
📝 Summary:
A self-evolving framework for open-world embodied agents that couples execution diagnosis with knowledge distillation to improve long-horizon task performance through structured experience organizatio...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13131
• PDF: https://arxiv.org/pdf/2603.13131
• Github: https://github.com/xzw-ustc/Steve-Evolving
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
A self-evolving framework for open-world embodied agents that couples execution diagnosis with knowledge distillation to improve long-horizon task performance through structured experience organizatio...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13131
• PDF: https://arxiv.org/pdf/2603.13131
• Github: https://github.com/xzw-ustc/Steve-Evolving
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration
📝 Summary:
Video generative models can be adapted for image restoration tasks with minimal training data by treating restoration as a progressive generative process. AI-generated summary Large-scale video genera...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13089
• PDF: https://arxiv.org/pdf/2603.13089
• Project Page: https://zhengsh123.github.io/V-Bridge/
• Github: https://github.com/Zhengsh123/V-Bridge
🔹 Models citing this paper:
• https://huggingface.co/desimfj/V-Bridge
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
Video generative models can be adapted for image restoration tasks with minimal training data by treating restoration as a progressive generative process. AI-generated summary Large-scale video genera...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13089
• PDF: https://arxiv.org/pdf/2603.13089
• Project Page: https://zhengsh123.github.io/V-Bridge/
• Github: https://github.com/Zhengsh123/V-Bridge
🔹 Models citing this paper:
• https://huggingface.co/desimfj/V-Bridge
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨Multimodal OCR: Parse Anything from Documents
📝 Summary:
MOCR is a multimodal OCR approach that jointly parses text and graphics into unified representations, enabling structured document reconstruction and supporting end-to-end training with semantic relat...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13032
• PDF: https://arxiv.org/pdf/2603.13032
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
MOCR is a multimodal OCR approach that jointly parses text and graphics into unified representations, enabling structured document reconstruction and supporting end-to-end training with semantic relat...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13032
• PDF: https://arxiv.org/pdf/2603.13032
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
✨Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
📝 Summary:
A novel detection framework called UCIP uses quantum statistical mechanics-inspired methods to distinguish between autonomous agents with genuine continuation objectives versus those pursuing continua...
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11382
• PDF: https://arxiv.org/pdf/2603.11382
• Project Page: https://lab.christopheraltman.com/
• Github: https://github.com/christopher-altman/persistence-signal-detector
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
A novel detection framework called UCIP uses quantum statistical mechanics-inspired methods to distinguish between autonomous agents with genuine continuation objectives versus those pursuing continua...
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11382
• PDF: https://arxiv.org/pdf/2603.11382
• Project Page: https://lab.christopheraltman.com/
• Github: https://github.com/christopher-altman/persistence-signal-detector
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
This media is not supported in your browser
VIEW IN TELEGRAM
✨Can Vision-Language Models Solve the Shell Game?
📝 Summary:
Vision-Language Models struggle with tracking identical visual entities, performing poorly on the VET-Bench testbed. Researchers propose Spatiotemporal Grounded Chain-of-Thought SGCoT to generate object trajectories as intermediate states. This method achieves over 90% accuracy, showing VLMs can ...
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08436
• PDF: https://arxiv.org/pdf/2603.08436
• Project Page: https://vetbench.github.io/
• Github: https://github.com/liutiedong/shellgame
🔹 Models citing this paper:
• https://huggingface.co/tiedong/Molmo2-SGCoT
✨ Datasets citing this paper:
• https://huggingface.co/datasets/tiedong/vetbench
• https://huggingface.co/datasets/tiedong/Molmo2-SGCoT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
📝 Summary:
Vision-Language Models struggle with tracking identical visual entities, performing poorly on the VET-Bench testbed. Researchers propose Spatiotemporal Grounded Chain-of-Thought SGCoT to generate object trajectories as intermediate states. This method achieves over 90% accuracy, showing VLMs can ...
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08436
• PDF: https://arxiv.org/pdf/2603.08436
• Project Page: https://vetbench.github.io/
• Github: https://github.com/liutiedong/shellgame
🔹 Models citing this paper:
• https://huggingface.co/tiedong/Molmo2-SGCoT
✨ Datasets citing this paper:
• https://huggingface.co/datasets/tiedong/vetbench
• https://huggingface.co/datasets/tiedong/Molmo2-SGCoT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #DataScience #MachineLearning #HuggingFace #Research
This media is not supported in your browser
VIEW IN TELEGRAM
✨HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
📝 Summary:
HomeSafe-Bench presents a benchmark for vision-language models to detect unsafe actions by embodied agents in household settings. It also introduces HD-Guard, a hierarchical dual-brain architecture balancing real-time safety monitoring with detection accuracy.
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11975
• PDF: https://arxiv.org/pdf/2603.11975
• Project Page: https://pujiayue.github.io/homesafe-bench.github.io/
• Github: https://github.com/pujiayue/HomeSafe-Bench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/pujiayue/HomeSafe-Bench-Leaderboard
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #EmbodiedAI #AISafety #Robotics #Benchmark
📝 Summary:
HomeSafe-Bench presents a benchmark for vision-language models to detect unsafe actions by embodied agents in household settings. It also introduces HD-Guard, a hierarchical dual-brain architecture balancing real-time safety monitoring with detection accuracy.
🔹 Publication Date: Published on Mar 12
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.11975
• PDF: https://arxiv.org/pdf/2603.11975
• Project Page: https://pujiayue.github.io/homesafe-bench.github.io/
• Github: https://github.com/pujiayue/HomeSafe-Bench
✨ Spaces citing this paper:
• https://huggingface.co/spaces/pujiayue/HomeSafe-Bench-Leaderboard
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #EmbodiedAI #AISafety #Robotics #Benchmark
❤1
✨NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval
📝 Summary:
NanoVDR improves visual document retrieval by distilling a large VLM teacher into a small 70M text-only query encoder. This decouples document indexing from query processing, achieving 50x lower latency and 32x fewer parameters with nearly identical quality.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12824
• PDF: https://arxiv.org/pdf/2603.12824
• Project Page: https://huggingface.co/nanovdr
🔹 Models citing this paper:
• https://huggingface.co/nanovdr/NanoVDR-L
• https://huggingface.co/nanovdr/NanoVDR-S-Multi
• https://huggingface.co/nanovdr/NanoVDR-S
✨ Spaces citing this paper:
• https://huggingface.co/spaces/nanovdr/NanoVDR-Demo
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisualDocumentRetrieval #ModelDistillation #VLM #InformationRetrieval #DeepLearning
📝 Summary:
NanoVDR improves visual document retrieval by distilling a large VLM teacher into a small 70M text-only query encoder. This decouples document indexing from query processing, achieving 50x lower latency and 32x fewer parameters with nearly identical quality.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12824
• PDF: https://arxiv.org/pdf/2603.12824
• Project Page: https://huggingface.co/nanovdr
🔹 Models citing this paper:
• https://huggingface.co/nanovdr/NanoVDR-L
• https://huggingface.co/nanovdr/NanoVDR-S-Multi
• https://huggingface.co/nanovdr/NanoVDR-S
✨ Spaces citing this paper:
• https://huggingface.co/spaces/nanovdr/NanoVDR-Demo
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisualDocumentRetrieval #ModelDistillation #VLM #InformationRetrieval #DeepLearning
❤1
✨Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP
📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP