Article Title:
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Article Date: Fiona Ryan
Article Description:
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: https://github.com/fkryan/gazelle .PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2412.09586v1.pdf
GitHub:
• https://github.com/fkryan/gazelle
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Article Date: Fiona Ryan
Article Description:
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: https://github.com/fkryan/gazelle .PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2412.09586v1.pdf
GitHub:
• https://github.com/fkryan/gazelle
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - fkryan/gazelle: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders (CVPR 2025, Highlight)
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders (CVPR 2025, Highlight) - fkryan/gazelle
❤1
🔹 Title:
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
🔹 Publication Date: Published on Jun 8
🔹 Abstract:
Frame Guidance offers a training-free method for controlling video generation using frame-level signals, reducing memory usage and enhancing globally coherent video output. AI-generated summary Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals , such as keyframes , style reference images, sketches , or depth maps . For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation . Frame Guidance enables effective control across diverse tasks, including keyframe guidance , stylization , and looping , without any training, compatible with any video models . Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07177
• PDF: https://arxiv.org/pdf/2506.07177
• Github: https://frame-guidance-video.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
🔹 Publication Date: Published on Jun 8
🔹 Abstract:
Frame Guidance offers a training-free method for controlling video generation using frame-level signals, reducing memory usage and enhancing globally coherent video output. AI-generated summary Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals , such as keyframes , style reference images, sketches , or depth maps . For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation . Frame Guidance enables effective control across diverse tasks, including keyframe guidance , stylization , and looping , without any training, compatible with any video models . Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07177
• PDF: https://arxiv.org/pdf/2506.07177
• Github: https://frame-guidance-video.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Frame Guidance: Training-Free Guidance for Frame-Level Control in...
Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale...
❤1
🔹 Title:
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
🔹 Publication Date: Published on Jun 26
🔹 Abstract:
A physics-based skinning and rigging framework called PhysRig uses volumetric representation and continuum mechanics for more realistic and physically plausible animations. AI-generated summary Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS) , due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh ), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes , significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20936
• PDF: https://arxiv.org/pdf/2506.20936
• Project Page: https://physrig.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
🔹 Publication Date: Published on Jun 26
🔹 Abstract:
A physics-based skinning and rigging framework called PhysRig uses volumetric representation and continuum mechanics for more realistic and physically plausible animations. AI-generated summary Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS) , due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh ), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes , significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20936
• PDF: https://arxiv.org/pdf/2506.20936
• Project Page: https://physrig.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
PhysRig: Differentiable Physics-Based Skinning and Rigging...
Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning...
❤4
🔹 Title:
Scaling Test-time Compute for LLM Agents
🔹 Publication Date: Published on Jun 15
🔹 Abstract:
Systematic exploration of test-time scaling methods in large language agents reveals that computational scaling improves performance, especially through parallel sampling, sequential revision, effective verification, and increased rollout diversity. AI-generated summary Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms ; (2) sequential revision strategies; (3) verifiers and merging methods ; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.12928
• PDF: https://arxiv.org/pdf/2506.12928
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Scaling Test-time Compute for LLM Agents
🔹 Publication Date: Published on Jun 15
🔹 Abstract:
Systematic exploration of test-time scaling methods in large language agents reveals that computational scaling improves performance, especially through parallel sampling, sequential revision, effective verification, and increased rollout diversity. AI-generated summary Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms ; (2) sequential revision strategies; (3) verifiers and merging methods ; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.12928
• PDF: https://arxiv.org/pdf/2506.12928
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Scaling Test-time Compute for LLM Agents
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying...
❤5
Forwarded from Python | Machine Learning | Coding | R
This channels is for Programmers, Coders, Software Engineers.
0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages
✅ https://t.iss.one/addlist/8_rRW2scgfRhOTc0
✅ https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
Article Title:
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Article Date: 1 Feb 2024
Article Description:
We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for multi-modal data that connects multi-step generative processes and aligns their outputs with user objectives in complex workflows. As a result, we can transition between the capabilities of various foundation models with in-context learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. Through these operations based on in-context learning our framework enables the creation and evaluation of explainable computational graphs. Finally, we introduce a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00854v4.pdf
GitHub:
• https://github.com/ExtensityAI/symbolicai
• https://github.com/extensityai/benchmark
• https://github.com/xpitfire/symbolicai
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Article Date: 1 Feb 2024
Article Description:
We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for multi-modal data that connects multi-step generative processes and aligns their outputs with user objectives in complex workflows. As a result, we can transition between the capabilities of various foundation models with in-context learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. Through these operations based on in-context learning our framework enables the creation and evaluation of explainable computational graphs. Finally, we introduce a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00854v4.pdf
GitHub:
• https://github.com/ExtensityAI/symbolicai
• https://github.com/extensityai/benchmark
• https://github.com/xpitfire/symbolicai
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤8
Article Title:
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Article Date: 19 Feb 2025
Article Description:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.14005v1.pdf
GitHub:
• https://github.com/niceringnode/lggpt
Datasehttps://t.iss.one/DataScienceTts:
• PubLayNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Article Date: 19 Feb 2025
Article Description:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.14005v1.pdf
GitHub:
• https://github.com/niceringnode/lggpt
Datasehttps://t.iss.one/DataScienceTts:
• PubLayNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Article Date: 4 Jun 2025
Article Description:
Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.04410v1.pdf
GitHub:
• https://github.com/cognitiveailab/matter-of-fact
Datasets:
• COVID-Fact
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Article Date: 4 Jun 2025
Article Description:
Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.04410v1.pdf
GitHub:
• https://github.com/cognitiveailab/matter-of-fact
Datasets:
• COVID-Fact
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Article Title:
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Outlier-Safe Pre-Training improves large language model quantization performance by preventing extreme activation outliers through innovative training techniques. AI-generated summary Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance , hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer , eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm , preventing channel-wise amplification; and (3) a learnable embedding projection , redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment . The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/outlier-safe-pre-training-for-robust-4-bit-quantization-of-large-language-models
• PDF: https://arxiv.org/pdf/2506.19697
• Project Page: https://huggingface.co/papers?q=learnable%20embedding%20projection
• Github: https://github.com/dmis-lab/Outlier-Safe-Pre-Training
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Outlier-Safe Pre-Training improves large language model quantization performance by preventing extreme activation outliers through innovative training techniques. AI-generated summary Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance , hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer , eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm , preventing channel-wise amplification; and (3) a learnable embedding projection , redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment . The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/outlier-safe-pre-training-for-robust-4-bit-quantization-of-large-language-models
• PDF: https://arxiv.org/pdf/2506.19697
• Project Page: https://huggingface.co/papers?q=learnable%20embedding%20projection
• Github: https://github.com/dmis-lab/Outlier-Safe-Pre-Training
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - dmis-lab/Outlier-Safe-Pre-Training: [ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language…
[ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models - dmis-lab/Outlier-Safe-Pre-Training
❤1
Article Title:
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
Article Date: 5 Jun 2025
Article Description:
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05309v1.pdf
GitHub:
• https://github.com/niveck/LLMafia
Datasets:
• LLMafia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
Article Date: 5 Jun 2025
Article Description:
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05309v1.pdf
GitHub:
• https://github.com/niveck/LLMafia
Datasets:
• LLMafia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Article Date: 28 May 2025
Article Description:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22647v1.pdf
GitHub:
• https://github.com/meigen-ai/multitalk
Datasets:
• CelebV-HQ
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Article Date: 28 May 2025
Article Description:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22647v1.pdf
GitHub:
• https://github.com/meigen-ai/multitalk
Datasets:
• CelebV-HQ
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🔹 Title:
SAFE: Multitask Failure Detection for Vision-Language-Action Models
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SAFE is a failure detector for vision-language-action models that generalizes to unseen tasks by learning from high-level internal features of the models. AI-generated summary While vision-language-action models ( VLAs ) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detector s are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs . We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollout s, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, pi_0, and pi_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction . More qualitative results can be found at https://vla-safe.github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09937
• PDF: https://arxiv.org/pdf/2506.09937
• Github: https://vla-safe.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SAFE: Multitask Failure Detection for Vision-Language-Action Models
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SAFE is a failure detector for vision-language-action models that generalizes to unseen tasks by learning from high-level internal features of the models. AI-generated summary While vision-language-action models ( VLAs ) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detector s are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs . We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollout s, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, pi_0, and pi_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction . More qualitative results can be found at https://vla-safe.github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09937
• PDF: https://arxiv.org/pdf/2506.09937
• Github: https://vla-safe.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Article Date: Zhaoxuan Lu
Article Description:
The rapid advancement of autonomous driving systems has created a pressing need for accurate and robust lane detection to ensure driving safety and reliability. However, lane detection still faces several critical challenges in real-world scenarios: (1) severe occlusions caused by urban traffic and complex road layouts; (2) the difficulty of handling sharp curves and large curvature variations; and (3) varying lighting conditions that blur or degrade lane markings. To address these challenges, we propose DLNet, a novel direction-aware feature integration framework that integrates both low-level geometric details and high-level semantic cues. In particular, the approach includes:
(i) a Multi-Skip Feature Attention Block (MSFAB) to refine local lane features by adaptively fusing multi-scale representations,
(ii) a Context-Aware Feature Pyramid Network (CAFPN) to enhance global context modeling under adverse conditions, and
(iii) a Directional Lane IoU (DLIoU) loss function that explicitly encodes lane directionality and curvature, providing more accurate lane overlap estimation. Extensive experiments conducted on two benchmark datasets, CULane and CurveLanes, show DLNet achieves new state-of-the-art results, with F150 and F175 scores of 81.23% and 64.75% on CULane, an F150 score of 86.51% on CurveLanes and a high F1 score of 97.62 on the TUSimple dataset. The source code and pretrained models will be made publicly available at https://github.com/RDXiaoLu/DLNet.git.PDFPrepare for 2025 PDFPrepare for 2025 Abstract
PDF Download Link:
Not Available
GitHub:
• https://github.com/RDXiaoLu/DLNet
• https://github.com/RDXiaoLu/DLNet.git
• https://github.com/RDXiaoLu/DLNet/tree/main
Datasets:
• CULane
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Date: Zhaoxuan Lu
Article Description:
The rapid advancement of autonomous driving systems has created a pressing need for accurate and robust lane detection to ensure driving safety and reliability. However, lane detection still faces several critical challenges in real-world scenarios: (1) severe occlusions caused by urban traffic and complex road layouts; (2) the difficulty of handling sharp curves and large curvature variations; and (3) varying lighting conditions that blur or degrade lane markings. To address these challenges, we propose DLNet, a novel direction-aware feature integration framework that integrates both low-level geometric details and high-level semantic cues. In particular, the approach includes:
(i) a Multi-Skip Feature Attention Block (MSFAB) to refine local lane features by adaptively fusing multi-scale representations,
(ii) a Context-Aware Feature Pyramid Network (CAFPN) to enhance global context modeling under adverse conditions, and
(iii) a Directional Lane IoU (DLIoU) loss function that explicitly encodes lane directionality and curvature, providing more accurate lane overlap estimation. Extensive experiments conducted on two benchmark datasets, CULane and CurveLanes, show DLNet achieves new state-of-the-art results, with F150 and F175 scores of 81.23% and 64.75% on CULane, an F150 score of 86.51% on CurveLanes and a high F1 score of 97.62 on the TUSimple dataset. The source code and pretrained models will be made publicly available at https://github.com/RDXiaoLu/DLNet.git.PDFPrepare for 2025 PDFPrepare for 2025 Abstract
PDF Download Link:
Not Available
GitHub:
• https://github.com/RDXiaoLu/DLNet
• https://github.com/RDXiaoLu/DLNet.git
• https://github.com/RDXiaoLu/DLNet/tree/main
Datasets:
• CULane
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - RDXiaoLu/DLNet: Direction-Aware Feature Integration for Robust Lane Detection in Complex Environments
Direction-Aware Feature Integration for Robust Lane Detection in Complex Environments - RDXiaoLu/DLNet
Article Title:
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Article Date: 9 Jun 2025
Article Description:
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07434v1.pdf
GitHub:
• https://github.com/F2-Song/Weak-to-Strong-Decoding
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Article Date: 9 Jun 2025
Article Description:
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07434v1.pdf
GitHub:
• https://github.com/F2-Song/Weak-to-Strong-Decoding
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Article Date: 30 Sep 2022
Article Description:
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2209.15270v1.pdf
GitHub:
• https://github.com/PaddlePaddle/ERNIE
Datasets:
• COCO (Common Objects in Context)
• Flickr30k
• CC12M
• COCO-CN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Date: 30 Sep 2022
Article Description:
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2209.15270v1.pdf
GitHub:
• https://github.com/PaddlePaddle/ERNIE
Datasets:
• COCO (Common Objects in Context)
• Flickr30k
• CC12M
• COCO-CN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Forwarded from Python | Machine Learning | Coding | R
🚀 THE 7-DAY PROFIT CHALLENGE! 🚀
Can you turn $100 into $5,000 in just 7 days?
Jay can. And she’s challenging YOU to do the same. 👇
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
Can you turn $100 into $5,000 in just 7 days?
Jay can. And she’s challenging YOU to do the same. 👇
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
❤1
Article Title:
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09366
• PDF: https://arxiv.org/pdf/2506.09366
• Project Page: https://usc-gvl.github.io/SkillBlender-web/
• Github: https://usc-gvl.github.io/SkillBlender-web/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09366
• PDF: https://arxiv.org/pdf/2506.09366
• Project Page: https://usc-gvl.github.io/SkillBlender-web/
• Github: https://usc-gvl.github.io/SkillBlender-web/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
SkillBlender: Towards Versatile Humanoid Whole-Body...
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant...
❤2
Article Title:
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
• https://github.com/yuliang-liu/monkeyocr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
• https://github.com/yuliang-liu/monkeyocr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1