Data Science | Machine Learning with Python for Researchers
31.7K subscribers
1.93K photos
102 videos
22 files
2.21K links
Admin: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
Article Title:
GOLFer: Smaller LM-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval

Article Date: 5 Jun 2025

Article Description:
Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.04762v1.pdf

GitHub:
https://github.com/liuliuyuan6/GOLFer

Datasets:
• MS MARCO
• BEIR
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

🔹 Publication Date: Published on Jan 8

🔹 Abstract:
rStar-Math enhances small language models' math reasoning capabilities through Monte Carlo Tree Search and self-evolution, achieving state-of-the-art performance on various benchmarks without distillation from larger models. AI-generated summary We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model . rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM ; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model ( PPM ); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark , it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad ( AIME ), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2408.08152
• PDF: https://arxiv.org/pdf/2501.04519
• Github: https://github.com/microsoft/rStar/issues/9

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3🔥1
🔹 Title:
PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

🔹 Publication Date: Published on Jun 9

🔹 Abstract:
PolyVivid is a multi-subject video customization framework that uses text-image fusion, 3D-RoPE enhancement, attention-inherited identity injection, and MLLM-based data processing to ensure identity consistency and realistic video generation. AI-generated summary Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity , video realism , and subject alignment , outperforming existing open-source and commercial baselines.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07848
• PDF: https://arxiv.org/pdf/2506.07848
• Project Page: https://sjtuplayer.github.io/projects/PolyVivid/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

🔹 Publication Date: Published on Jun 10

🔹 Abstract:
Autoregressive Semantic Visual Reconstruction (ASVR) improves multimodal understanding by focusing on semantic reconstruction rather than raw visual appearance, enhancing performance across various benchmarks. AI-generated summary Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation , effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens , resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09040
• PDF: https://arxiv.org/pdf/2506.09040

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
Article Title:
SkyReels-V2: Infinite-length Film Generative Model

Article Date: 17 Apr 2025

Article Description:
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2504.13074v3.pdf

GitHub:
https://github.com/skyworkai/skyreels-v2

Datasets:
• No datasets information available
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

🔹 Publication Date: Published on Jun 11

🔹 Abstract:
ComfyUI-R1, a large reasoning model for automated workflow generation, demonstrates superior performance in creating AI art workflows through long chain-of-thought reasoning and reinforcement learning. AI-generated summary AI-generated content has evolved from monolithic models to modular workflows , particularly on platforms like ComfyUI , enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI -R1, the first large reasoning model for automated workflow generation . Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection , workflow planning , and code-level workflow representation. ComfyUI -R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward , ensuring format validity , structural integrity, and node-level fidelity . Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate , node-level and graph-level F1 scores , significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series . Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes , underscoring the potential of long CoT reasoning in AI art creation.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09790
• PDF: https://arxiv.org/pdf/2506.09790
• Project Page: https://github.com/AIDC-AI/ComfyUI-Copilot
• Github: https://github.com/AIDC-AI/ComfyUI-Copilot

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

🔹 Publication Date: Published on Jun 11

🔹 Abstract:
ReasonMed, a large medical reasoning dataset, enhances the accuracy of medical question answering models by combining detailed reasoning paths with concise summaries, setting new benchmarks for model performance. AI-generated summary Though reasoning-based large language models ( LLMs ) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed , the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs . ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed , we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B , which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09513
• PDF: https://arxiv.org/pdf/2506.09513
• Github: https://github.com/YuSun-Work/ReasonMed

🔹 Datasets citing this paper:
https://huggingface.co/datasets/YuSun-AI/ReasonMed

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
🔹 Title:
EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

🔹 Publication Date: Published on Jun 12

🔹 Abstract:
EmbodiedGen is a platform that generates high-quality, photorealistic 3D assets at low cost, enabling scalable and realistic embodied AI research through generative AI techniques. AI-generated summary Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality , controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D , Text-to-3D , Texture Generation , Articulated Object Generation, Scene Generation and Layout Generation . EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets , leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10600
• PDF: https://arxiv.org/pdf/2506.10600
• Project Page: https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html
• Github: https://github.com/HorizonRobotics/EmbodiedGen.git

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Image-to-3D
https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Texture-Gen
https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Text-to-3D
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
Branched Schrödinger Bridge Matching

🔹 Publication Date: Published on Jun 10

🔹 Abstract:
BranchSBM, a novel generative modeling framework, extends Schr\"odinger Bridge Matching to model branched stochastic paths and multi-path evolution from a single initial distribution to multiple outcomes. AI-generated summary Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schr\"odinger Bridge Matching , effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct outcomes. To address this, we introduce Branched Schr\"odinger Bridge Matching ( BranchSBM ), a novel framework that learns branched Schr\"odinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes , enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09007
• PDF: https://arxiv.org/pdf/2506.09007

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
7
Article Title:
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Article Date: 9 Apr 2024

Article Description:
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2404.06395v3.pdf

GitHub:
https://github.com/openbmb/minicpm
https://github.com/pwc-1/Paper-9/tree/main/2/minicpm
https://github.com/pwc-1/Paper-5/tree/main/minicpm

Datasets:
• MML
• MMLU
• GSM8K
• MATH
• HumanEval
• HellaSwag
• C4
• MBPP
• MT-Bench
• BBH
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
Article Title:
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

Article Date: 20 Dec 2023

Article Description:
Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic "prompt engineering". We introduce LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. We also propose strategies to use assertions at inference time for automatic self-refinement with LMs. We report on four diverse case studies for text generation and find that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. Our reference implementation of LM Assertions is integrated into DSPy at https://github.com/stanfordnlp/dspyPDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2312.13382v2.pdf

GitHub:
https://github.com/stanfordnlp/dspy

Datasets:
• HotpotQA
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
Reparameterized LLM Training via Orthogonal Equivalence Transformation

🔹 Publication Date: Published on Jun 9

🔹 Abstract:
A new reParameterized training algorithm named POET uses Orthogonal Equivalence Transformation to optimize neurons, providing stable optimization and improved generalization for training large-scale neural networks including LLMs. AI-generated summary While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices , POET can stably optimize the objective function with improved generalization . We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.08001
• PDF: https://arxiv.org/pdf/2506.08001
• Project Page: https://spherelab.ai/poet/
• Github: https://github.com/Sphere-AI-Lab/poet

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
5
Article Title:
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Article Date: 28 May 2025

Article Description:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.22019v1.pdf

GitHub:
https://github.com/alibaba-nlp/vrag

Datasets:
• No datasets information available
==================================

For more data science resources:

https://t.iss.one/DataScienceT
5
🔹 Title:
Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text

🔹 Publication Date: Published on Sep 24, 2021

🔹 Abstract:
Analysis reveals that BERT is robust to natural linguistic variations but insensitive to the removal of clinically important information in text for Alzheimer's disease prediction. AI-generated summary Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2109.11888
• PDF: https://arxiv.org/pdf/2109.11888

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
https://huggingface.co/spaces/Jekaterina/bert-robustness
==================================

For more data science resources:

https://t.iss.one/DataScienceT
5
🔹 Title:
The Diffusion Duality

🔹 Publication Date: Published on Jun 12

🔹 Abstract:
Duo improves uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, enhancing training speed and enabling fast few-step text generation. AI-generated summary Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation , which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.github.io/duo

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10892
• PDF: https://arxiv.org/pdf/2506.10892
• Project Page: https://s-sahoo.com/duo/
• Github: https://github.com/s-sahoo/duo

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
ECO: Ensembling Context Optimization for Vision-Language Models

🔹 Publication Date: Published on Jul 26, 2023

🔹 Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2307.14063
• PDF: https://arxiv.org/pdf/2307.14063

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
4
Article Title:
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Article Date: Xuanchi Ren

Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract

PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf

GitHub:
https://github.com/nv-tlabs/GEN3C

Datasets:
• Waymo Open Dataset
• Kubric
• RealEstate10K
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner

🔹 Publication Date: Published on Jun 10

🔹 Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09003
• PDF: https://arxiv.org/pdf/2506.09003
• Github: https://github.com/Hambaobao/SWE-Flow

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

🔹 Publication Date: Published on Jun 11

🔹 Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09427
• PDF: https://arxiv.org/pdf/2506.09427

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1