ML Research Hub
32.6K subscribers
3.39K photos
132 videos
23 files
3.61K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho
Download Telegram
VideoSSR: Video Self-Supervised Reinforcement Learning

📝 Summary:
VideoSSR is a novel self-supervised reinforcement learning framework that leverages intrinsic video information to generate high-quality training data. It uses three pretext tasks and the VideoSSR-30K dataset, improving MLLM performance across 17 benchmarks by over 5%.

🔹 Publication Date: Published on Nov 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06281
• PDF: https://arxiv.org/pdf/2511.06281
• Project Page: https://github.com/lcqysl/VideoSSR
• Github: https://github.com/lcqysl/VideoSSR

🔹 Models citing this paper:
https://huggingface.co/yhx12/VideoSSR

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #SelfSupervisedLearning #VideoAI #MachineLearning #DeepLearning
Media is too big
VIEW IN TELEGRAM
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

📝 Summary:
UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation. It uses a Plan-and-Act architecture with hierarchical memory to enable complex, iterative video workflows. This system aims to advance agentic video intelligence.

🔹 Publication Date: Published on Nov 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08521
• PDF: https://arxiv.org/pdf/2511.08521
• Project Page: https://univa.online/
• Github: https://github.com/univa-agent/univa

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoAI #AIagents #GenerativeAI #ComputerVision #OpenSource
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

📝 Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16669
• PDF: https://arxiv.org/pdf/2511.16669
• Project Page: https://video-as-answer.github.io/
• Github: https://github.com/KlingTeam/VANS

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

📝 Summary:
QTSplus is a query-aware token selector for long-video multimodal language models. It dynamically selects the most important visual tokens based on a text query, significantly compressing vision data and reducing latency. This method maintains overall accuracy and enhances temporal understanding ...

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://huggingface.co/collections/AlpachinoNLP/qtsplus
• PDF: https://arxiv.org/pdf/2511.11910
• Project Page: https://qtsplus.github.io/
• Github: https://github.com/Siyou-Li/QTSplus

🔹 Models citing this paper:
https://huggingface.co/AlpachinoNLP/QTSplus-3B
https://huggingface.co/AlpachinoNLP/QTSplus-3B-FT

Spaces citing this paper:
https://huggingface.co/spaces/AlpachinoNLP/QTSplus-3B

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #VideoAI #LLM #Tokenization #ComputerVision