ML Research Hub
32.6K subscribers
3.35K photos
129 videos
23 files
3.57K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho
Download Telegram
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

📝 Summary:
TimeSearch-R improves long-form video understanding by optimizing temporal search with reinforcement learning. It uses GRPO-CSV to verify searched frame completeness, leading to improved reasoning. This achieves state-of-the-art performance on multiple video benchmarks.

🔹 Publication Date: Published on Nov 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.05489
• PDF: https://arxiv.org/pdf/2511.05489
• Github: https://github.com/Time-Search/TimeSearch-R

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoUnderstanding #ReinforcementLearning #DeepLearning #AIResearch #ComputerVision
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

📝 Summary:
EmoVid is a new multimodal, emotion-annotated video dataset designed for creative media like cartoons and movies. It bridges emotion understanding with video generation, significantly improving emotional expression and quality in generated videos. EmoVid establishes a new benchmark for affective ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11002
• PDF: https://arxiv.org/pdf/2511.11002

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#EmoVid #MultimodalAI #EmotionAI #VideoGeneration #VideoUnderstanding
Dynamic Reflections: Probing Video Representations with Text Alignment

📝 Summary:
This work presents the first comprehensive study on video-text representation alignment. It reveals alignment depends on data richness and correlates with downstream task performance, suggesting its value for general video understanding. This introduces video-text alignment as a zero-shot method ...

🔹 Publication Date: Published on Nov 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02767
• PDF: https://arxiv.org/pdf/2511.02767
• Github: https://video-prh.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoUnderstanding #TextAlignment #VideoTextAI #ZeroShotLearning #RepresentationLearning
1
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
VIDEOP2R: Video Understanding from Perception to Reasoning

📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning