✨MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
✨TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
📝 Summary:
TimeSearch-R improves long-form video understanding by optimizing temporal search with reinforcement learning. It uses GRPO-CSV to verify searched frame completeness, leading to improved reasoning. This achieves state-of-the-art performance on multiple video benchmarks.
🔹 Publication Date: Published on Nov 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.05489
• PDF: https://arxiv.org/pdf/2511.05489
• Github: https://github.com/Time-Search/TimeSearch-R
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #DeepLearning #AIResearch #ComputerVision
📝 Summary:
TimeSearch-R improves long-form video understanding by optimizing temporal search with reinforcement learning. It uses GRPO-CSV to verify searched frame completeness, leading to improved reasoning. This achieves state-of-the-art performance on multiple video benchmarks.
🔹 Publication Date: Published on Nov 7
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.05489
• PDF: https://arxiv.org/pdf/2511.05489
• Github: https://github.com/Time-Search/TimeSearch-R
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #DeepLearning #AIResearch #ComputerVision
✨EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
📝 Summary:
EmoVid is a new multimodal, emotion-annotated video dataset designed for creative media like cartoons and movies. It bridges emotion understanding with video generation, significantly improving emotional expression and quality in generated videos. EmoVid establishes a new benchmark for affective ...
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11002
• PDF: https://arxiv.org/pdf/2511.11002
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EmoVid #MultimodalAI #EmotionAI #VideoGeneration #VideoUnderstanding
📝 Summary:
EmoVid is a new multimodal, emotion-annotated video dataset designed for creative media like cartoons and movies. It bridges emotion understanding with video generation, significantly improving emotional expression and quality in generated videos. EmoVid establishes a new benchmark for affective ...
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11002
• PDF: https://arxiv.org/pdf/2511.11002
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EmoVid #MultimodalAI #EmotionAI #VideoGeneration #VideoUnderstanding
✨Dynamic Reflections: Probing Video Representations with Text Alignment
📝 Summary:
This work presents the first comprehensive study on video-text representation alignment. It reveals alignment depends on data richness and correlates with downstream task performance, suggesting its value for general video understanding. This introduces video-text alignment as a zero-shot method ...
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02767
• PDF: https://arxiv.org/pdf/2511.02767
• Github: https://video-prh.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #TextAlignment #VideoTextAI #ZeroShotLearning #RepresentationLearning
📝 Summary:
This work presents the first comprehensive study on video-text representation alignment. It reveals alignment depends on data richness and correlates with downstream task performance, suggesting its value for general video understanding. This introduces video-text alignment as a zero-shot method ...
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02767
• PDF: https://arxiv.org/pdf/2511.02767
• Github: https://video-prh.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #TextAlignment #VideoTextAI #ZeroShotLearning #RepresentationLearning
❤1
✨REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
✨VIDEOP2R: Video Understanding from Perception to Reasoning
📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.
🔹 Publication Date: Published on Nov 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
✨TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning
📝 Summary:
TimeViper is a hybrid Mamba-Transformer vision-language model for efficient long video understanding. It introduces a TransV module to compress redundant vision tokens into instruction tokens, enabling it to process over 10,000 frames. This achieves state-of-the-art performance while offering new...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16595
• PDF: https://arxiv.org/pdf/2511.16595
• Project Page: https://xuboshen.github.io/TimeViper/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#TimeViper #VisionLanguageModels #VideoUnderstanding #MambaTransformer #DeepLearning