ML Research Hub
32.6K subscribers
3.76K photos
186 videos
23 files
4.02K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
MiniCPM-V: A GPT-4V Level MLLM on Your Phone

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of #AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient #MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong #OCR capability and 1.8M pixel high-resolution #image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Paper: https://arxiv.org/pdf/2408.01800v1.pdf

Codes:
https://github.com/OpenBMB/MiniCPM-o
https://github.com/openbmb/minicpm-v

Datasets: Video-MME

#MachineLearning #DeepLearning #BigData #Datascience #ML #HealthTech #DataVisualization #ArtificialInteligence #SoftwareEngineering #GenAI #deeplearning #ChatGPT #OpenAI #python #AI #keras #SQL #Statistics

https://t.iss.one/DataScienceT ❤️
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3
🤖🧠 PaddleOCR-VL: Redefining Multilingual Document Parsing with a 0.9B Vision-Language Model

🗓️ 20 Oct 2025
📚 AI News & Trends

In an era where information is predominantly digital, the ability to extract, interpret and organize data from documents is crucial. From invoices and research papers to multilingual contracts and handwritten notes, document parsing stands at the intersection of vision and language. Traditional Optical Character Recognition (OCR) systems have made impressive strides but they often fall ...

#PaddleOCR-VL #Multilingual #DocumentParsing #VisionLanguageModel #OCR #AI
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

📝 Summary:
PaddleOCR-VL is a new 0.9B vision-language model for document parsing. It uses a NaViT-style visual encoder and ERNIE-4.5, achieving state-of-the-art performance across 109 languages with minimal resources and fast inference. This model is highly suitable for practical deployment.

🔹 Publication Date: Published on Oct 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• Github: https://github.com/PaddlePaddle/PaddleOCR

🔹 Models citing this paper:
https://huggingface.co/PaddlePaddle/PaddleOCR-VL
https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
https://huggingface.co/lvyufeng/PaddleOCR-VL-0.9B

Spaces citing this paper:
https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
https://huggingface.co/spaces/markobinario/PaddleOCR-VL_Online_Demo
https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#OCR #VisionLanguageModel #DocumentAI #DeepLearning #AI
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

📝 Summary:
olmOCR is an open-source toolkit that uses a fine-tuned vision language model to convert PDFs into clean, structured text. It enables large-scale, cost-effective extraction of trillions of tokens for training language models.

🔹 Publication Date: Published on Feb 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2502.18443
• PDF: https://arxiv.org/pdf/2502.18443
• Github: https://github.com/allenai/olmocr

Datasets citing this paper:
https://huggingface.co/datasets/davanstrien/test-olmocr2
https://huggingface.co/datasets/davanstrien/newspapers-olmocr2
https://huggingface.co/datasets/stckmn/ocr-output-Directive017-1761355297

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#OCR #VLMs #LLM #DataExtraction #OpenSource
HunyuanOCR Technical Report

📝 Summary:
HunyuanOCR is a lightweight Vision-Language Model for OCR, using a unified end-to-end architecture ViT + LLM. It achieves state-of-the-art performance in diverse tasks, outperforming larger models and commercial APIs, powered by data-driven and RL strategies.

🔹 Publication Date: Published on Nov 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19575
• PDF: https://arxiv.org/pdf/2511.19575
• Github: https://github.com/Tencent-Hunyuan/HunyuanOCR

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#OCR #VisionLanguageModel #LLM #AI #MachineLearning
NVIDIA Nemotron Parse 1.1

📝 Summary:
Nemotron-Parse-1.1 is a lightweight OCR and document parsing model with improved capabilities. It excels in general OCR, markdown, structured tables, and text extraction from images using an encoder-decoder architecture. The model achieves competitive accuracy and is publicly released.

🔹 Publication Date: Published on Nov 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.20478
• PDF: https://arxiv.org/pdf/2511.20478

🔹 Models citing this paper:
https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1-TC

Spaces citing this paper:
https://huggingface.co/spaces/prithivMLmods/NVIDIA-Nemotron-Parse-OCR

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#OCR #DocumentParsing #DeepLearning #AI #NVIDIA