πΉ Title:
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
Autoregressive Semantic Visual Reconstruction (ASVR) improves multimodal understanding by focusing on semantic reconstruction rather than raw visual appearance, enhancing performance across various benchmarks. AI-generated summary Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation , effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens , resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09040
β’ PDF: https://arxiv.org/pdf/2506.09040
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
Autoregressive Semantic Visual Reconstruction (ASVR) improves multimodal understanding by focusing on semantic reconstruction rather than raw visual appearance, enhancing performance across various benchmarks. AI-generated summary Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation , effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens , resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09040
β’ PDF: https://arxiv.org/pdf/2506.09040
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - AlenjandroWang/ASVR: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better - AlenjandroWang/ASVR
β€3
Article Title:
SkyReels-V2: Infinite-length Film Generative Model
Article Date: 17 Apr 2025
Article Description:
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.13074v3.pdf
GitHub:
β’ https://github.com/skyworkai/skyreels-v2
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
SkyReels-V2: Infinite-length Film Generative Model
Article Date: 17 Apr 2025
Article Description:
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2504.13074v3.pdf
GitHub:
β’ https://github.com/skyworkai/skyreels-v2
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
ComfyUI-R1, a large reasoning model for automated workflow generation, demonstrates superior performance in creating AI art workflows through long chain-of-thought reasoning and reinforcement learning. AI-generated summary AI-generated content has evolved from monolithic models to modular workflows , particularly on platforms like ComfyUI , enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI -R1, the first large reasoning model for automated workflow generation . Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection , workflow planning , and code-level workflow representation. ComfyUI -R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward , ensuring format validity , structural integrity, and node-level fidelity . Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate , node-level and graph-level F1 scores , significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series . Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes , underscoring the potential of long CoT reasoning in AI art creation.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09790
β’ PDF: https://arxiv.org/pdf/2506.09790
β’ Project Page: https://github.com/AIDC-AI/ComfyUI-Copilot
β’ Github: https://github.com/AIDC-AI/ComfyUI-Copilot
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
ComfyUI-R1, a large reasoning model for automated workflow generation, demonstrates superior performance in creating AI art workflows through long chain-of-thought reasoning and reinforcement learning. AI-generated summary AI-generated content has evolved from monolithic models to modular workflows , particularly on platforms like ComfyUI , enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI -R1, the first large reasoning model for automated workflow generation . Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection , workflow planning , and code-level workflow representation. ComfyUI -R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward , ensuring format validity , structural integrity, and node-level fidelity . Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate , node-level and graph-level F1 scores , significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series . Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes , underscoring the potential of long CoT reasoning in AI art creation.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09790
β’ PDF: https://arxiv.org/pdf/2506.09790
β’ Project Page: https://github.com/AIDC-AI/ComfyUI-Copilot
β’ Github: https://github.com/AIDC-AI/ComfyUI-Copilot
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective...
β€3
πΉ Title:
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
ReasonMed, a large medical reasoning dataset, enhances the accuracy of medical question answering models by combining detailed reasoning paths with concise summaries, setting new benchmarks for model performance. AI-generated summary Though reasoning-based large language models ( LLMs ) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed , the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs . ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed , we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B , which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09513
β’ PDF: https://arxiv.org/pdf/2506.09513
β’ Github: https://github.com/YuSun-Work/ReasonMed
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/YuSun-AI/ReasonMed
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
ReasonMed, a large medical reasoning dataset, enhances the accuracy of medical question answering models by combining detailed reasoning paths with concise summaries, setting new benchmarks for model performance. AI-generated summary Though reasoning-based large language models ( LLMs ) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed , the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs . ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed , we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B , which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09513
β’ PDF: https://arxiv.org/pdf/2506.09513
β’ Github: https://github.com/YuSun-Work/ReasonMed
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/YuSun-AI/ReasonMed
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing...
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently...
β€1
πΉ Title:
EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
EmbodiedGen is a platform that generates high-quality, photorealistic 3D assets at low cost, enabling scalable and realistic embodied AI research through generative AI techniques. AI-generated summary Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality , controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D , Text-to-3D , Texture Generation , Articulated Object Generation, Scene Generation and Layout Generation . EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets , leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10600
β’ PDF: https://arxiv.org/pdf/2506.10600
β’ Project Page: https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html
β’ Github: https://github.com/HorizonRobotics/EmbodiedGen.git
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Image-to-3D
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Texture-Gen
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Text-to-3D
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
EmbodiedGen is a platform that generates high-quality, photorealistic 3D assets at low cost, enabling scalable and realistic embodied AI research through generative AI techniques. AI-generated summary Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality , controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D , Text-to-3D , Texture Generation , Articulated Object Generation, Scene Generation and Layout Generation . EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets , leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10600
β’ PDF: https://arxiv.org/pdf/2506.10600
β’ Project Page: https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html
β’ Github: https://github.com/HorizonRobotics/EmbodiedGen.git
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Image-to-3D
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Texture-Gen
β’ https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Text-to-3D
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
Branched SchrΓΆdinger Bridge Matching
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
BranchSBM, a novel generative modeling framework, extends Schr\"odinger Bridge Matching to model branched stochastic paths and multi-path evolution from a single initial distribution to multiple outcomes. AI-generated summary Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schr\"odinger Bridge Matching , effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct outcomes. To address this, we introduce Branched Schr\"odinger Bridge Matching ( BranchSBM ), a novel framework that learns branched Schr\"odinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes , enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09007
β’ PDF: https://arxiv.org/pdf/2506.09007
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Branched SchrΓΆdinger Bridge Matching
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
BranchSBM, a novel generative modeling framework, extends Schr\"odinger Bridge Matching to model branched stochastic paths and multi-path evolution from a single initial distribution to multiple outcomes. AI-generated summary Predicting the intermediate trajectories between an initial and target distribution is a central problem in generative modeling. Existing approaches, such as flow matching and Schr\"odinger Bridge Matching , effectively learn mappings between two distributions by modeling a single stochastic path. However, these methods are inherently limited to unimodal transitions and cannot capture branched or divergent evolution from a common origin to multiple distinct outcomes. To address this, we introduce Branched Schr\"odinger Bridge Matching ( BranchSBM ), a novel framework that learns branched Schr\"odinger bridges. BranchSBM parameterizes multiple time-dependent velocity fields and growth processes , enabling the representation of population-level divergence into multiple terminal distributions. We show that BranchSBM is not only more expressive but also essential for tasks involving multi-path surface navigation, modeling cell fate bifurcations from homogeneous progenitor states, and simulating diverging cellular responses to perturbations .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09007
β’ PDF: https://arxiv.org/pdf/2506.09007
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€7
Article Title:
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Article Date: 9 Apr 2024
Article Description:
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2404.06395v3.pdf
GitHub:
β’ https://github.com/openbmb/minicpm
β’ https://github.com/pwc-1/Paper-9/tree/main/2/minicpm
β’ https://github.com/pwc-1/Paper-5/tree/main/minicpm
Datasets:
β’ MML
β’ MMLU
β’ GSM8K
β’ MATH
β’ HumanEval
β’ HellaSwag
β’ C4
β’ MBPP
β’ MT-Bench
β’ BBH
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Article Date: 9 Apr 2024
Article Description:
The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2404.06395v3.pdf
GitHub:
β’ https://github.com/openbmb/minicpm
β’ https://github.com/pwc-1/Paper-9/tree/main/2/minicpm
β’ https://github.com/pwc-1/Paper-5/tree/main/minicpm
Datasets:
β’ MML
β’ MMLU
β’ GSM8K
β’ MATH
β’ HumanEval
β’ HellaSwag
β’ C4
β’ MBPP
β’ MT-Bench
β’ BBH
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - OpenBMB/MiniCPM: MiniCPM4 & MiniCPM4.1: Ultra-Efficient LLMs on End Devices, achieving 3+ generation speedup on reasoningβ¦
MiniCPM4 & MiniCPM4.1: Ultra-Efficient LLMs on End Devices, achieving 3+ generation speedup on reasoning tasks - OpenBMB/MiniCPM
β€1
Article Title:
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
Article Date: 20 Dec 2023
Article Description:
Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic "prompt engineering". We introduce LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. We also propose strategies to use assertions at inference time for automatic self-refinement with LMs. We report on four diverse case studies for text generation and find that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. Our reference implementation of LM Assertions is integrated into DSPy at https://github.com/stanfordnlp/dspyPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2312.13382v2.pdf
GitHub:
β’ https://github.com/stanfordnlp/dspy
Datasets:
β’ HotpotQA
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
Article Date: 20 Dec 2023
Article Description:
Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic "prompt engineering". We introduce LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. We also propose strategies to use assertions at inference time for automatic self-refinement with LMs. We report on four diverse case studies for text generation and find that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. Our reference implementation of LM Assertions is integrated into DSPy at https://github.com/stanfordnlp/dspyPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2312.13382v2.pdf
GitHub:
β’ https://github.com/stanfordnlp/dspy
Datasets:
β’ HotpotQA
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
πΉ Title:
Reparameterized LLM Training via Orthogonal Equivalence Transformation
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A new reParameterized training algorithm named POET uses Orthogonal Equivalence Transformation to optimize neurons, providing stable optimization and improved generalization for training large-scale neural networks including LLMs. AI-generated summary While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices , POET can stably optimize the objective function with improved generalization . We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08001
β’ PDF: https://arxiv.org/pdf/2506.08001
β’ Project Page: https://spherelab.ai/poet/
β’ Github: https://github.com/Sphere-AI-Lab/poet
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Reparameterized LLM Training via Orthogonal Equivalence Transformation
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A new reParameterized training algorithm named POET uses Orthogonal Equivalence Transformation to optimize neurons, providing stable optimization and improved generalization for training large-scale neural networks including LLMs. AI-generated summary While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices , POET can stably optimize the objective function with improved generalization . We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08001
β’ PDF: https://arxiv.org/pdf/2506.08001
β’ Project Page: https://spherelab.ai/poet/
β’ Github: https://github.com/Sphere-AI-Lab/poet
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€5
Article Title:
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Article Date: 28 May 2025
Article Description:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22019v1.pdf
GitHub:
β’ https://github.com/alibaba-nlp/vrag
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Article Date: 28 May 2025
Article Description:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22019v1.pdf
GitHub:
β’ https://github.com/alibaba-nlp/vrag
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - Alibaba-NLP/VRAG: Repo for "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding viaβ¦
Repo for "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning" - Alibaba-NLP/VRAG
β€5
πΉ Title:
Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text
πΉ Publication Date: Published on Sep 24, 2021
πΉ Abstract:
Analysis reveals that BERT is robust to natural linguistic variations but insensitive to the removal of clinically important information in text for Alzheimer's disease prediction. AI-generated summary Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2109.11888
β’ PDF: https://arxiv.org/pdf/2109.11888
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/Jekaterina/bert-robustness
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text
πΉ Publication Date: Published on Sep 24, 2021
πΉ Abstract:
Analysis reveals that BERT is robust to natural linguistic variations but insensitive to the removal of clinically important information in text for Alzheimer's disease prediction. AI-generated summary Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2109.11888
β’ PDF: https://arxiv.org/pdf/2109.11888
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/Jekaterina/bert-robustness
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Robustness and Sensitivity of BERT Models Predicting...
Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their...
β€5
πΉ Title:
The Diffusion Duality
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
Duo improves uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, enhancing training speed and enabling fast few-step text generation. AI-generated summary Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation , which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.github.io/duo
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10892
β’ PDF: https://arxiv.org/pdf/2506.10892
β’ Project Page: https://s-sahoo.com/duo/
β’ Github: https://github.com/s-sahoo/duo
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
The Diffusion Duality
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
Duo improves uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, enhancing training speed and enabling fast few-step text generation. AI-generated summary Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation , which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.github.io/duo
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10892
β’ PDF: https://arxiv.org/pdf/2506.10892
β’ Project Page: https://s-sahoo.com/duo/
β’ Github: https://github.com/s-sahoo/duo
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
ECO: Ensembling Context Optimization for Vision-Language Models
πΉ Publication Date: Published on Jul 26, 2023
πΉ Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2307.14063
β’ PDF: https://arxiv.org/pdf/2307.14063
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
ECO: Ensembling Context Optimization for Vision-Language Models
πΉ Publication Date: Published on Jul 26, 2023
πΉ Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2307.14063
β’ PDF: https://arxiv.org/pdf/2307.14063
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
ECO: Ensembling Context Optimization for Vision-Language Models
Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has...
β€4
Article Title:
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Article Date: Xuanchi Ren
Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf
GitHub:
β’ https://github.com/nv-tlabs/GEN3C
Datasets:
β’ Waymo Open Dataset
β’ Kubric
β’ RealEstate10K
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Article Date: Xuanchi Ren
Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf
GitHub:
β’ https://github.com/nv-tlabs/GEN3C
Datasets:
β’ Waymo Open Dataset
β’ Kubric
β’ RealEstate10K
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09003
β’ PDF: https://arxiv.org/pdf/2506.09003
β’ Github: https://github.com/Hambaobao/SWE-Flow
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
πΉ Publication Date: Published on Jun 10
πΉ Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09003
β’ PDF: https://arxiv.org/pdf/2506.09003
β’ Github: https://github.com/Hambaobao/SWE-Flow
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
Hambaobao - Overview
My name is Lei Zhang, Ph.D. student of University of Chinese Academy of Sciences. - Hambaobao
β€2
πΉ Title:
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09427
β’ PDF: https://arxiv.org/pdf/2506.09427
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09427
β’ PDF: https://arxiv.org/pdf/2506.09427
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
πΉ Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
πΉ Publication Date:
Published on May 27
πΉ Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
πΉ Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
πΉ Publication Date:
Published on May 27
πΉ Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
πΉ Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
β€3
πΉ Title:
Aligning Text, Images, and 3D Structure Token-by-Token
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08002
β’ PDF: https://arxiv.org/pdf/2506.08002
β’ Project Page: https://glab-caltech.github.io/kyvo/
β’ Github: https://glab-caltech.github.io/kyvo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Aligning Text, Images, and 3D Structure Token-by-Token
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08002
β’ PDF: https://arxiv.org/pdf/2506.08002
β’ Project Page: https://glab-caltech.github.io/kyvo/
β’ Github: https://glab-caltech.github.io/kyvo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3π3