Data Science | Machine Learning with Python for Researchers
31.5K subscribers
1.6K photos
102 videos
22 files
1.87K links
Admin: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
This media is not supported in your browser
VIEW IN TELEGRAM
πŸ’ƒ GENMO: Generalist Human Motion by NVIDIA πŸ’ƒ

NVIDIA introduces GENMO, a unified generalist model for human motion that seamlessly combines motion estimation and generation within a single framework. GENMO supports conditioning on videos, 2D keypoints, text, music, and 3D keyframes, enabling highly versatile motion understanding and synthesis.

Currently, no official code release is available.

Review:
https://t.ly/Q5T_Y

Paper:
https://lnkd.in/ds36BY49

Project Page:
https://lnkd.in/dAYHhuFU

#NVIDIA #GENMO #HumanMotion #DeepLearning #AI #ComputerVision #MotionGeneration #MachineLearning #MultimodalAI #3DReconstruction


βœ‰οΈ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

πŸ“± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
πŸ‘4❀3
Generative AI with LangChain 2025

Docker Deep Dive 2025

Available now in our paid channels

Join our paid channel for free books and courses (2$ monthly)
https://t.iss.one/+r_Tcx2c-oVU1OWNi
πŸ‘1
Forwarded from Thomas
πŸͺ™ +30.560$ with 300$ in a month of trading! We can teach you how to earn! FREE!

It was a challenge - a marathon 300$ to 30.000$ on trading, together with Lisa!

What is the essence of earning?: "Analyze and open a deal on the exchange, knowing where the currency rate will go. Lisa trades every day and posts signals on her channel for free."

πŸ”ΉStart: $150
πŸ”Ή Goal: $20,000
πŸ”ΉPeriod: 1.5 months.

Join and get started, there will be no second chanceπŸ‘‡

https://t.iss.one/+HjHm7mxR5xllNTY5
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

πŸ–₯ Github: https://github.com/haoningwu3639/SpatialScore

πŸ“• Paper: https://arxiv.org/abs/2505.17012v1

πŸ”— Tasks: https://paperswithcode.com/task/motion-estimation

https://t.iss.one/DataScienceT 🌟
Please open Telegram to view this post
VIEW IN TELEGRAM
❀1πŸ‘1
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

20 May 2025 Β· Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao liu, Can Huang Β·

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin


Paper: https://arxiv.org/pdf/2505.14059v1.pdf

Code: https://github.com/bytedance/dolphin

Dataset: PubTabNet

βœ‰οΈ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

πŸ“± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❀3
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

20 May 2025 Β· Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu Β·

Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.


Paper: https://arxiv.org/pdf/2505.14362v1.pdf

Code: https://github.com/visual-agent/deepeyes

Datasets: MS COCO - RefCOCO - MathVista - MathVerse

βœ‰οΈ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

πŸ“± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❀1
MMaDA: Multimodal Large Diffusion Language Models

21 May 2025 Β· Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang Β·

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA


Paper: https://arxiv.org/pdf/2505.15809v1.pdf

Code: https://github.com/gen-verse/mmada

Colab: https://huggingface.co/spaces/Gen-Verse/MMaDA

βœ‰οΈ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

πŸ“± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❀5
OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

26 May 2025 Β· Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Chongyang Ma, Jiebo Luo, Li Yuan Β·

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.


Paper: https://arxiv.org/pdf/2505.20292v1.pdf

Codes:
https://github.com/PKU-YuanGroup/ConsisID
https://github.com/PKU-YuanGroup/OpenS2V-Nexus

Datasets: OpenS2V-5M - OpenS2V-Eval

βœ‰οΈ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk

πŸ“± Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❀4
πŸ”₯ Accelerate Your IT Career with FREE Certification Kits!
πŸš€ Get Hired Fasterβ€”Zero Cost!
Grab expert guides, labs, and courses for AWS, Azure, AI, Python, Cyber Security, and beyondβ€”100% FREE, no hidden fees!
βœ… CLICK your fieldπŸ‘‡
βœ… DOWNLOAD & dominate your goals!
πŸ”— AWS + Azure Cloud Mastery: https://bit.ly/44S0dNS
πŸ”— AI & Machine Learning Starter Kit: https://bit.ly/3FrKw5H
πŸ”— Python, Excel, Cyber Security Courses: https://bit.ly/4mFrA4g
πŸ“˜ FREE Career Hack: IT Success Roadmap E-book βž” https://bit.ly/3Z6JS49

🚨 Limited Time! Act FAST!
πŸ“± Join Our IT Study Group: https://bit.ly/43piMq8
πŸ’¬ 1-on-1 Exam Help: https://wa.link/sbpp0m
Your dream job won’t waitβ€”GRAB YOUR RESOURCES NOW! πŸ’»βœ¨
❀4πŸ‘1
RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination

28 May 2025 Β· Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong Β·

We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.


Paper: https://arxiv.org/pdf/2505.21925v1.pdf

Code: https://github.com/microsoft/renderformer

Dataset: Objaverse

https://t.iss.one/DataScienceT πŸ”—
Please open Telegram to view this post
VIEW IN TELEGRAM
❀1πŸ‘1
Article Title: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Article Date: 26 May 2025

Article Description:
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2505.19457v1.pdf

GitHub:
- https://github.com/hithink-research/bizfinbench

Datasets:
- FinQA
==================================================

For more data science resources:
https://t.iss.one/DataScienceT
❀1
Article Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Article Date: 22 May 2025

Article Description:
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2505.16211v1.pdf

GitHub:
- https://github.com/jusperlee/audiotrust

Datasets:
- SafetyBench
==================================================

For more data science resources:
https://t.iss.one/DataScienceT
❀1
Article Title: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Article Date: 21 May 2025

Article Description:
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at https://github.com/eric-ai-lab/Soft-Thinking.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2505.15778v1.pdf

GitHub:
- https://github.com/eric-ai-lab/soft-thinking

Datasets:
- No datasets information available
==================================================

For more data science resources:
https://t.iss.one/DataScienceT
❀2
Article Title: ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Article Date: 29 May 2025

Article Description:
Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K. The video results, code and trained models are available on our project page: https://lhmd.top/zpressor.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2505.23734v1.pdf

GitHub:
- https://github.com/ziplab/ZPressor

Datasets:
- ACID
- RealEstate10K
- DL3DV-10k
==================================================

For more data science resources:
https://t.iss.one/DataScienceT
❀2
Article Title: Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Article Date: 20 Jan 2025

Article Description:
We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2501.13956v1.pdf

GitHub:
- https://github.com/getzep/graphiti

Datasets:
- No datasets information available
==================================================

For more data science resources:
https://t.iss.one/DataScienceT
❀2πŸ”₯1
Article Title:
UniVST: A Unified Framework for Training-free Localized Video Style Transfer

Article Date: 26 Oct 2024

Article Description:
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion model. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided video style transfer mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2410.20084v3.pdf

GitHub:
- https://github.com/QuanjianSong/UniVST

Datasets:
- WikiArt
- LAION-Aesthetics V2 6.5+
==================================

For more data science resources:
https://t.iss.one/DataScienceT
❀3
Article Title:
syftr: Pareto-Optimal Generative AI

Article Date: 26 May 2025

Article Description:
Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr's ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines.PDFAbstract

Article Download Link: https://arxiv.org/pdf/2505.20266v1.pdf

GitHub:
- https://github.com/datarobot/syftr

Datasets:
- FinanceBench
==================================

For more data science resources:
https://t.iss.one/DataScienceT
❀1
Article Title:
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Article Date:
20 May 2025

Article Description:

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract

Article Download Link:
https://arxiv.org/pdf/2505.14059v1.pdf

GitHub:

- https://github.com/bytedance/dolphin

Datasets:

- PubTabNet
==================================

For more data science resources:

https://t.iss.one/DataScienceT