MiniCPM-V: A GPT-4V Level MLLM on Your Phone
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of #AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient #MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong #OCR capability and 1.8M pixel high-resolution #image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
Paper: https://arxiv.org/pdf/2408.01800v1.pdf
Codes:
https://github.com/OpenBMB/MiniCPM-o
https://github.com/openbmb/minicpm-v
Datasets: Video-MME
#MachineLearning #DeepLearning #BigData #Datascience #ML #HealthTech #DataVisualization #ArtificialInteligence #SoftwareEngineering #GenAI #deeplearning #ChatGPT #OpenAI #python #AI #keras #SQL #Statistics
https://t.iss.one/DataScienceT❤️
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of #AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient #MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong #OCR capability and 1.8M pixel high-resolution #image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
Paper: https://arxiv.org/pdf/2408.01800v1.pdf
Codes:
https://github.com/OpenBMB/MiniCPM-o
https://github.com/openbmb/minicpm-v
Datasets: Video-MME
#MachineLearning #DeepLearning #BigData #Datascience #ML #HealthTech #DataVisualization #ArtificialInteligence #SoftwareEngineering #GenAI #deeplearning #ChatGPT #OpenAI #python #AI #keras #SQL #Statistics
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3
✨When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs
📝 Summary:
A new framework explains MLLM conflict resolution by decomposing modality following into relative reasoning uncertainty and inherent modality preference. Modality following decreases with relative uncertainty. Inherent preference is measured at the balance point, offering mechanistic insights.
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02243
• PDF: https://arxiv.org/pdf/2511.02243
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #MultimodalAI #LLM #DeepLearning #AIResearch
📝 Summary:
A new framework explains MLLM conflict resolution by decomposing modality following into relative reasoning uncertainty and inherent modality preference. Modality following decreases with relative uncertainty. Inherent preference is measured at the balance point, offering mechanistic insights.
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02243
• PDF: https://arxiv.org/pdf/2511.02243
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #MultimodalAI #LLM #DeepLearning #AIResearch
✨SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
📝 Summary:
SAIL-RL uses a dual-reward RL system to teach MLLMs when and how to think. This improves reasoning, reduces hallucinations, and achieves competitive performance against commercial models like GPT-4o.
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02280
• PDF: https://arxiv.org/pdf/2511.02280
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #ReinforcementLearning #AI #GenerativeAI #DeepLearning
📝 Summary:
SAIL-RL uses a dual-reward RL system to teach MLLMs when and how to think. This improves reasoning, reduces hallucinations, and achieves competitive performance against commercial models like GPT-4o.
🔹 Publication Date: Published on Nov 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.02280
• PDF: https://arxiv.org/pdf/2511.02280
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #ReinforcementLearning #AI #GenerativeAI #DeepLearning
✨MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
📝 Summary:
MVU-Eval is a new comprehensive benchmark for evaluating Multi-Video Understanding in Multimodal Large Language Models. It addresses a critical gap in existing single-video benchmarks and reveals significant performance limitations in current MLLMs for multi-video scenarios.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07250
• PDF: https://arxiv.org/pdf/2511.07250
• Project Page: https://huggingface.co/datasets/MVU-Eval-Team/MVU-Eval-Data
• Github: https://github.com/NJU-LINK/MVU-Eval
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #VideoUnderstanding #AI #Benchmarking #ComputerVision
✨MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
📝 Summary:
MathSE improves MLLMs math reasoning by iteratively refining them. It uses inference, reflection, and reward-based feedback instead of static datasets. This significantly boosts performance on tough benchmarks, outperforming leading open-source models.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06805
• PDF: https://arxiv.org/pdf/2511.06805
• Project Page: https://zheny2751-dotcom.github.io/MathSE.github.io/
• Github: https://github.com/zheny2751-dotcom/MathSE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MathematicalReasoning #MLLMs #AI #MachineLearning #ReinforcementLearning
📝 Summary:
MathSE improves MLLMs math reasoning by iteratively refining them. It uses inference, reflection, and reward-based feedback instead of static datasets. This significantly boosts performance on tough benchmarks, outperforming leading open-source models.
🔹 Publication Date: Published on Nov 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.06805
• PDF: https://arxiv.org/pdf/2511.06805
• Project Page: https://zheny2751-dotcom.github.io/MathSE.github.io/
• Github: https://github.com/zheny2751-dotcom/MathSE
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MathematicalReasoning #MLLMs #AI #MachineLearning #ReinforcementLearning
✨SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
📝 Summary:
SafeGRPO introduces a self-rewarded, rule-governed framework for multimodal safety alignment in MLLMs. It integrates verifiable reward construction and step-guided safety thinking to improve robustness against compositional risks and enhance reasoning stability.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.12982
• PDF: https://arxiv.org/pdf/2511.12982
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #AISafety #MultimodalAI #ReinforcementLearning #AIResearch
📝 Summary:
SafeGRPO introduces a self-rewarded, rule-governed framework for multimodal safety alignment in MLLMs. It integrates verifiable reward construction and step-guided safety thinking to improve robustness against compositional risks and enhance reasoning stability.
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.12982
• PDF: https://arxiv.org/pdf/2511.12982
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MLLMs #AISafety #MultimodalAI #ReinforcementLearning #AIResearch
✨REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
🔹 Publication Date: Published on Nov 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
✨PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
📝 Summary:
PC-Agent is a hierarchical multi-agent framework improving MLLM-based GUI agents for complex PC tasks. It uses an Active Perception Module and a hierarchical decision-making architecture with Manager, Progress, and Decision agents. A Reflection agent provides feedback. It achieved a 32% task succ...
🔹 Publication Date: Published on Feb 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.14282
• PDF: https://arxiv.org/pdf/2502.14282
• Github: https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent
✨ Spaces citing this paper:
• https://huggingface.co/spaces/junyangwang0410/PC-Agent
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultiAgentSystems #AIAgents #MLLMs #PCAutomation #DeepLearning
📝 Summary:
PC-Agent is a hierarchical multi-agent framework improving MLLM-based GUI agents for complex PC tasks. It uses an Active Perception Module and a hierarchical decision-making architecture with Manager, Progress, and Decision agents. A Reflection agent provides feedback. It achieved a 32% task succ...
🔹 Publication Date: Published on Feb 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.14282
• PDF: https://arxiv.org/pdf/2502.14282
• Github: https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent
✨ Spaces citing this paper:
• https://huggingface.co/spaces/junyangwang0410/PC-Agent
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultiAgentSystems #AIAgents #MLLMs #PCAutomation #DeepLearning