Data Science | Machine Learning with Python for Researchers

FastVLM: Efficient Vision Encoding for Vision Language Models

17 Dec 2024 · Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari ·

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as #ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2
improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (11521152), #FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B #LLM, but with 85 faster TTFT and a vision encoder that is 3.4 smaller.

Paper: https://arxiv.org/pdf/2412.13303v1.pdf

Code: https://github.com/apple/ml-fastvlm

Datasets: GQA - TextVQA - ScienceQA

https://t.iss.one/DataScienceT

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍1

2.23K viewsedited 06:23

Data Science | Machine Learning with Python for Researchers

Generating Physically Stable and Buildable LEGO Designs from Text

8 May 2025 · Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu ·

We introduce #LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/

Paper: https://arxiv.org/pdf/2505.05469v1.pdf

Code: https://github.com/AvaLovelace1/LegoGPT

Quick start: https://huggingface.co/spaces/cmu-gil/LegoGPT-Demo

Dataset: StableText2Lego

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍3

2.14K views06:32

Data Science | Machine Learning with Python for Researchers

This media is not supported in your browser

VIEW IN TELEGRAM

🩷 Dance meets #ComputerVision

🩷

Saint-Étienne University has introduced a new 3D human body pose estimation pipeline designed specifically for dance analysis.
Check out the project page featuring results and an interactive demo! 💙

👉 Paper review: https://t.ly/JEdM3

👉 Full paper: https://arxiv.org/pdf/2505.07249

👉 Project page: https://lnkd.in/dD5dsMv5

#DanceAnalysis #3DPoseEstimation #DeepLearning #HumanPose #AI #MachineLearning #ComputerVisionResearch

🔗 Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍1

2.17K views15:08

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

✅

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

✅

https://t.iss.one/Codeprogrammer

Please open Telegram to view this post

VIEW IN TELEGRAM

1.15K views20:27

Data Science | Machine Learning with Python for Researchers

Flow-GRPO: Training Flow Matching Models via Online RL

8 May 2025 · Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang ·

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95%. In visual text rendering, its accuracy improves from 59% to 92%, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, very little reward hacking occurred, meaning rewards did not increase at the cost of appreciable image quality or diversity degradation.

Paper: https://arxiv.org/pdf/2505.05470v2.pdf

code: https://github.com/yifan123/flow_grpo

HG: https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO

Datasets: DrawBench - GenEval - T2I-CompBench

Notes: Ranked #1 on Text-to-Image Generation on GenEval

🔗 Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍2

2.17K viewsedited 06:15

Data Science | Machine Learning with Python for Researchers

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

9 May 2025 · Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li ·

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Paper: https://arxiv.org/pdf/2505.06111v2.pdf

Code: https://github.com/opendrivelab/univla

Datasets: R2R - VLN-CE - Open-X-Embodiment

🔗 Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍1

2.29K viewsedited 06:29

Data Science | Machine Learning with Python for Researchers

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

24 Apr 2025 · Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang ·

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.

Paper: https://arxiv.org/pdf/2504.17192v2.pdf

Code: https://github.com/going-doer/paper2code

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4

2.38K views17:52

Data Science | Machine Learning with Python for Researchers

Forwarded from Thomas

🎁❗️TODAY FREE❗️🎁

Entry to our VIP channel is completely free today. Tomorrow it will cost $500! 🔥

JOIN 👇

https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5
https://t.iss.one/+VKT2Gy3kE6A4NTE5

👍1

2.29K views14:33

Data Science | Machine Learning with Python for Researchers

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

6 May 2025 · Andrew Zhao, Yiran Wu, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang ·

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (#AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall #SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Paper: https://arxiv.org/pdf/2505.03335v2.pdf

Code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

Dataset: CityPersons

📱 WhatsApp Channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

❤1

2.16K views10:02

Data Science | Machine Learning with Python for Researchers

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

14 May 2025 · Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, ran Xu ·

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

Paper: https://arxiv.org/pdf/2505.09568v1.pdf

Code: https://github.com/jiuhaichen/blip3o

👍2❤1🔥1

2.11K viewsedited 10:04

Data Science | Machine Learning with Python for Researchers

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

13 May 2025 · Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng ·

While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images"

Paper: https://arxiv.org/pdf/2505.08617v1.pdf

Dataset: https://github.com/zhaochen0110/openthinkimg

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

2.42K viewsedited 08:00

Data Science | Machine Learning with Python for Researchers

Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

🖥

Github: https://github.com/reml-group/deliberation-on-priors

📕

Paper: https://arxiv.org/abs/2505.15210v1

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤1

2.14K viewsedited 16:24

Data Science | Machine Learning with Python for Researchers

PIA S5 Proxy solves all problems for AI developers.

🔥 Why top AI teams choose PIA S5 Proxy:

🔹

SOCKS5 proxy: as low as $0.045/IP
✅ Global high-quality IP | No traffic limit / static IP
✅ High success rate >99.9% | Ultra-low latency | Stable anti-ban
✅ Smart crawling API, support seamless integration

🔹

Unlimited traffic proxy: only $79/day
✅ Unlimited traffic | Unlimited concurrency | Bandwidth over 100Gbps | Customization supported
✅ Best for large-scale AI / LLM data collection
✅ Save up to 90% on crawling costs

✨ Exclusive for new users:
Enter the coupon code [AI Python] to enjoy a 10% discount!

🚀 Buy now: https://www.piaproxy.com/?co=piaproxy&ck=?ai

Please open Telegram to view this post

VIEW IN TELEGRAM

Piaproxy

PIA Proxy - Largest Socks5 Residential Proxy Anonymous & Secure

Pia S5 Proxy is the world's largest commercial residential proxy service. With over 350 million fresh residential IPs that can be located by country, city, postcode, and ISP, it supports both HTTP(S) proxy and Socks5 proxy, allowing you to easily access the…

👍3❤1

1.98K views09:54

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

🎯 ابدأ رحلتك الاحترافية في البرمجة مع
#Python_Mastery_Course 🐍
هل ترغب بتعلم لغة البرمجة الأكثر طلبًا في العالم؟
هل تحلم بالوصول إلى مجالات مثل الذكاء الاصطناعي، تحليل البيانات أو تصميم الواجهات؟
📢 هذه الدورة خُصصت لتكون نقطة انطلاقك نحو المستقبل!
________________________________________
🚀 ماذا ستتعلم في هذه الدورة؟
🔹 الوحدة 1: أساسيات بايثون (المتغيرات – أنواع البيانات – العمليات – أساسيات الكود)
🔹 الوحدة 2: التحكم في سير البرنامج (الشروط – الحلقات – أوامر التحكم)
🔹 الوحدة 3: هياكل البيانات (قوائم – قواميس – مجموعات – Tuples)
🔹 الوحدة 4: الدوال (إنشاء – معاملات – النطاق – التكرار)
🔹 الوحدة 5: الوحدات (Modules)
🔹 الوحدة 6: التعامل مع الملفات وملفات CSV
🔹 الوحدة 7: معالجة الاستثناءات باحتراف
🔹 الوحدة 8: البرمجة الكائنية (OOP)
🔹 الوحدة 9: المفاهيم المتقدمة:
✅ المولدات (Generators)
✅ الكائنات القابلة للتكرار (Iterators)
✅ المزينات (Decorators)
💡 عند انتهائك ستكون قادرًا على:
✔️ بناء مشاريع حقيقية بلغة بايثون
✔️ الانتقال بثقة إلى مجالات متقدمة مثل الذكاء الاصطناعي وتحليل البيانات
✔️ أتمتة المهام والتعامل مع البيانات باحتراف

🎥 نظام الدورة:
• بث مباشر Live مع المدرب د. محمد عماد عرفه
• جميع المحاضرات ستُرفع على الموقع لتشاهدها في الوقت الذي يناسبك
🕒 مدة الدورة: 25 ساعة تدريبية
📅 تاريخ البداية:15- 6
💰 خصم للحجز المبكر
تواصل الآن مع ذكر كود الدورة"001"
https://t.iss.one/Agartha_Support

Agartha Support

1.4K views20:44

Data Science | Machine Learning with Python for Researchers

This media is not supported in your browser

VIEW IN TELEGRAM

💃

GENMO: Generalist Human Motion by NVIDIA

💃

NVIDIA introduces GENMO, a unified generalist model for human motion that seamlessly combines motion estimation and generation within a single framework. GENMO supports conditioning on videos, 2D keypoints, text, music, and 3D keyframes, enabling highly versatile motion understanding and synthesis.

Currently, no official code release is available.

Review:
https://t.ly/Q5T_Y

Paper:
https://lnkd.in/ds36BY49

Project Page:
https://lnkd.in/dAYHhuFU

#NVIDIA #GENMO #HumanMotion #DeepLearning #AI #ComputerVision #MotionGeneration #MachineLearning #MultimodalAI #3DReconstruction

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤3

2.14K viewsedited 05:11

Data Science | Machine Learning with Python for Researchers

Forwarded from Data Science Machine Learning Data Analysis

Generative AI with LangChain 2025

Docker Deep Dive 2025

Available now in our paid channels

Join our paid channel for free books and courses (2$ monthly)
https://t.iss.one/+r_Tcx2c-oVU1OWNi

👍1

1.78K views12:18

Data Science | Machine Learning with Python for Researchers

Forwarded from Thomas

🪙 +30.560$ with 300$ in a month of trading! We can teach you how to earn! FREE!

It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!

What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."

🔹Start: $150
🔹 Goal: $20,000
🔹Period: 1.5 months.

Join and get started, there will be no second chance👇

https://t.iss.one/+HjHm7mxR5xllNTY5

1.22K views14:58

Data Science | Machine Learning with Python for Researchers

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

🖥

Github: https://github.com/haoningwu3639/SpatialScore

📕

Paper: https://arxiv.org/abs/2505.17012v1

🔗 Tasks: https://paperswithcode.com/task/motion-estimation

https://t.iss.one/DataScienceT

🌟

Please open Telegram to view this post

VIEW IN TELEGRAM

❤1👍1

2.3K viewsedited 16:24

Data Science | Machine Learning with Python for Researchers

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

20 May 2025 · Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao liu, Can Huang ·

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

Paper: https://arxiv.org/pdf/2505.14059v1.pdf

Code: https://github.com/bytedance/dolphin

Dataset: PubTabNet

✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

Please open Telegram to view this post

VIEW IN TELEGRAM

❤3

2.15K viewsedited 08:21

About

Blog

Apps

Platform