Article Title:
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Article Date: 1 Feb 2024
Article Description:
We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for multi-modal data that connects multi-step generative processes and aligns their outputs with user objectives in complex workflows. As a result, we can transition between the capabilities of various foundation models with in-context learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. Through these operations based on in-context learning our framework enables the creation and evaluation of explainable computational graphs. Finally, we introduce a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00854v4.pdf
GitHub:
• https://github.com/ExtensityAI/symbolicai
• https://github.com/extensityai/benchmark
• https://github.com/xpitfire/symbolicai
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Article Date: 1 Feb 2024
Article Description:
We introduce SymbolicAI, a versatile and modular framework employing a logic-based approach to concept learning and flow management in generative processes. SymbolicAI enables the seamless integration of generative models with a diverse range of solvers by treating large language models (LLMs) as semantic parsers that execute tasks based on both natural and formal language instructions, thus bridging the gap between symbolic reasoning and generative AI. We leverage probabilistic programming principles to tackle complex tasks, and utilize differentiable and classical programming paradigms with their respective strengths. The framework introduces a set of polymorphic, compositional, and self-referential operations for multi-modal data that connects multi-step generative processes and aligns their outputs with user objectives in complex workflows. As a result, we can transition between the capabilities of various foundation models with in-context learning capabilities and specialized, fine-tuned models or solvers proficient in addressing specific problems. Through these operations based on in-context learning our framework enables the creation and evaluation of explainable computational graphs. Finally, we introduce a quality measure and its empirical score for evaluating these computational graphs, and propose a benchmark that compares various state-of-the-art LLMs across a set of complex workflows. We refer to the empirical score as the "Vector Embedding for Relational Trajectory Evaluation through Cross-similarity", or VERTEX score for short. The framework codebase and benchmark are linked below.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00854v4.pdf
GitHub:
• https://github.com/ExtensityAI/symbolicai
• https://github.com/extensityai/benchmark
• https://github.com/xpitfire/symbolicai
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤8
Article Title:
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Article Date: 19 Feb 2025
Article Description:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.14005v1.pdf
GitHub:
• https://github.com/niceringnode/lggpt
Datasehttps://t.iss.one/DataScienceTts:
• PubLayNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Article Date: 19 Feb 2025
Article Description:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2502.14005v1.pdf
GitHub:
• https://github.com/niceringnode/lggpt
Datasehttps://t.iss.one/DataScienceTts:
• PubLayNet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Article Date: 4 Jun 2025
Article Description:
Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.04410v1.pdf
GitHub:
• https://github.com/cognitiveailab/matter-of-fact
Datasets:
• COVID-Fact
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Article Date: 4 Jun 2025
Article Description:
Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.04410v1.pdf
GitHub:
• https://github.com/cognitiveailab/matter-of-fact
Datasets:
• COVID-Fact
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Article Title:
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Outlier-Safe Pre-Training improves large language model quantization performance by preventing extreme activation outliers through innovative training techniques. AI-generated summary Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance , hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer , eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm , preventing channel-wise amplification; and (3) a learnable embedding projection , redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment . The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/outlier-safe-pre-training-for-robust-4-bit-quantization-of-large-language-models
• PDF: https://arxiv.org/pdf/2506.19697
• Project Page: https://huggingface.co/papers?q=learnable%20embedding%20projection
• Github: https://github.com/dmis-lab/Outlier-Safe-Pre-Training
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Outlier-Safe Pre-Training improves large language model quantization performance by preventing extreme activation outliers through innovative training techniques. AI-generated summary Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance , hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer , eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm , preventing channel-wise amplification; and (3) a learnable embedding projection , redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment . The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/outlier-safe-pre-training-for-robust-4-bit-quantization-of-large-language-models
• PDF: https://arxiv.org/pdf/2506.19697
• Project Page: https://huggingface.co/papers?q=learnable%20embedding%20projection
• Github: https://github.com/dmis-lab/Outlier-Safe-Pre-Training
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - dmis-lab/Outlier-Safe-Pre-Training: [ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language…
[ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models - dmis-lab/Outlier-Safe-Pre-Training
❤1
Article Title:
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
Article Date: 5 Jun 2025
Article Description:
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05309v1.pdf
GitHub:
• https://github.com/niveck/LLMafia
Datasets:
• LLMafia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games
Article Date: 5 Jun 2025
Article Description:
LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant's decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent's behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05309v1.pdf
GitHub:
• https://github.com/niveck/LLMafia
Datasets:
• LLMafia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Article Date: 28 May 2025
Article Description:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22647v1.pdf
GitHub:
• https://github.com/meigen-ai/multitalk
Datasets:
• CelebV-HQ
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Article Date: 28 May 2025
Article Description:
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22647v1.pdf
GitHub:
• https://github.com/meigen-ai/multitalk
Datasets:
• CelebV-HQ
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🔹 Title:
SAFE: Multitask Failure Detection for Vision-Language-Action Models
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SAFE is a failure detector for vision-language-action models that generalizes to unseen tasks by learning from high-level internal features of the models. AI-generated summary While vision-language-action models ( VLAs ) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detector s are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs . We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollout s, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, pi_0, and pi_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction . More qualitative results can be found at https://vla-safe.github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09937
• PDF: https://arxiv.org/pdf/2506.09937
• Github: https://vla-safe.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SAFE: Multitask Failure Detection for Vision-Language-Action Models
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SAFE is a failure detector for vision-language-action models that generalizes to unseen tasks by learning from high-level internal features of the models. AI-generated summary While vision-language-action models ( VLAs ) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detector s are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs . We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollout s, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, pi_0, and pi_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction . More qualitative results can be found at https://vla-safe.github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09937
• PDF: https://arxiv.org/pdf/2506.09937
• Github: https://vla-safe.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Article Date: Zhaoxuan Lu
Article Description:
The rapid advancement of autonomous driving systems has created a pressing need for accurate and robust lane detection to ensure driving safety and reliability. However, lane detection still faces several critical challenges in real-world scenarios: (1) severe occlusions caused by urban traffic and complex road layouts; (2) the difficulty of handling sharp curves and large curvature variations; and (3) varying lighting conditions that blur or degrade lane markings. To address these challenges, we propose DLNet, a novel direction-aware feature integration framework that integrates both low-level geometric details and high-level semantic cues. In particular, the approach includes:
(i) a Multi-Skip Feature Attention Block (MSFAB) to refine local lane features by adaptively fusing multi-scale representations,
(ii) a Context-Aware Feature Pyramid Network (CAFPN) to enhance global context modeling under adverse conditions, and
(iii) a Directional Lane IoU (DLIoU) loss function that explicitly encodes lane directionality and curvature, providing more accurate lane overlap estimation. Extensive experiments conducted on two benchmark datasets, CULane and CurveLanes, show DLNet achieves new state-of-the-art results, with F150 and F175 scores of 81.23% and 64.75% on CULane, an F150 score of 86.51% on CurveLanes and a high F1 score of 97.62 on the TUSimple dataset. The source code and pretrained models will be made publicly available at https://github.com/RDXiaoLu/DLNet.git.PDFPrepare for 2025 PDFPrepare for 2025 Abstract
PDF Download Link:
Not Available
GitHub:
• https://github.com/RDXiaoLu/DLNet
• https://github.com/RDXiaoLu/DLNet.git
• https://github.com/RDXiaoLu/DLNet/tree/main
Datasets:
• CULane
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Date: Zhaoxuan Lu
Article Description:
The rapid advancement of autonomous driving systems has created a pressing need for accurate and robust lane detection to ensure driving safety and reliability. However, lane detection still faces several critical challenges in real-world scenarios: (1) severe occlusions caused by urban traffic and complex road layouts; (2) the difficulty of handling sharp curves and large curvature variations; and (3) varying lighting conditions that blur or degrade lane markings. To address these challenges, we propose DLNet, a novel direction-aware feature integration framework that integrates both low-level geometric details and high-level semantic cues. In particular, the approach includes:
(i) a Multi-Skip Feature Attention Block (MSFAB) to refine local lane features by adaptively fusing multi-scale representations,
(ii) a Context-Aware Feature Pyramid Network (CAFPN) to enhance global context modeling under adverse conditions, and
(iii) a Directional Lane IoU (DLIoU) loss function that explicitly encodes lane directionality and curvature, providing more accurate lane overlap estimation. Extensive experiments conducted on two benchmark datasets, CULane and CurveLanes, show DLNet achieves new state-of-the-art results, with F150 and F175 scores of 81.23% and 64.75% on CULane, an F150 score of 86.51% on CurveLanes and a high F1 score of 97.62 on the TUSimple dataset. The source code and pretrained models will be made publicly available at https://github.com/RDXiaoLu/DLNet.git.PDFPrepare for 2025 PDFPrepare for 2025 Abstract
PDF Download Link:
Not Available
GitHub:
• https://github.com/RDXiaoLu/DLNet
• https://github.com/RDXiaoLu/DLNet.git
• https://github.com/RDXiaoLu/DLNet/tree/main
Datasets:
• CULane
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - RDXiaoLu/DLNet: Direction-Aware Feature Integration for Robust Lane Detection in Complex Environments
Direction-Aware Feature Integration for Robust Lane Detection in Complex Environments - RDXiaoLu/DLNet
Article Title:
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Article Date: 9 Jun 2025
Article Description:
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07434v1.pdf
GitHub:
• https://github.com/F2-Song/Weak-to-Strong-Decoding
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Article Date: 9 Jun 2025
Article Description:
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07434v1.pdf
GitHub:
• https://github.com/F2-Song/Weak-to-Strong-Decoding
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Article Date: 30 Sep 2022
Article Description:
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2209.15270v1.pdf
GitHub:
• https://github.com/PaddlePaddle/ERNIE
Datasets:
• COCO (Common Objects in Context)
• Flickr30k
• CC12M
• COCO-CN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Article Date: 30 Sep 2022
Article Description:
Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2209.15270v1.pdf
GitHub:
• https://github.com/PaddlePaddle/ERNIE
Datasets:
• COCO (Common Objects in Context)
• Flickr30k
• CC12M
• COCO-CN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Forwarded from Python | Machine Learning | Coding | R
🚀 THE 7-DAY PROFIT CHALLENGE! 🚀
Can you turn $100 into $5,000 in just 7 days?
Jay can. And she’s challenging YOU to do the same. 👇
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
Can you turn $100 into $5,000 in just 7 days?
Jay can. And she’s challenging YOU to do the same. 👇
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
https://t.iss.one/+QOcycXvRiYs4YTk1
❤1
Article Title:
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
Article Date: 1 Feb 2024
Article Description:
With the surge in the development of large language models, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the large language model. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2402.00290v3.pdf
GitHub:
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09366
• PDF: https://arxiv.org/pdf/2506.09366
• Project Page: https://usc-gvl.github.io/SkillBlender-web/
• Github: https://usc-gvl.github.io/SkillBlender-web/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
SkillBlender is a hierarchical reinforcement learning framework that uses pretrained primitive skills to efficiently solve diverse loco-manipulation tasks for humanoid robots. AI-generated summary Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender , a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills , and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering . We also introduce SkillBench , a parallel, cross-embodiment , and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks , accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking , resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/ SkillBlender -web/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09366
• PDF: https://arxiv.org/pdf/2506.09366
• Project Page: https://usc-gvl.github.io/SkillBlender-web/
• Github: https://usc-gvl.github.io/SkillBlender-web/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
SkillBlender: Towards Versatile Humanoid Whole-Body...
Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant...
❤2
Article Title:
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
• https://github.com/yuliang-liu/monkeyocr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Article Date: 5 Jun 2025
Article Description:
We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05218v1.pdf
GitHub:
• https://github.com/yuliang-liu/monkeyocr
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Article Title:
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
• https://github.com/tauricresearch/tradingagents
Datasets:
• How Do I Login McAfee Antivirus Account?: A Complete Guide
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
• https://github.com/tauricresearch/tradingagents
Datasets:
• How Do I Login McAfee Antivirus Account?: A Complete Guide
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
🔹 Publication Date: Published on Dec 21, 2021
🔹 Abstract:
A fine-tuning solution for generalized few-shot semantic segmentation improves performance beyond meta-learning by addressing saturation and minimizing the performance gap between novel and base categories. AI-generated summary Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning , it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2112.10982
• PDF: https://arxiv.org/pdf/2112.10982
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
🔹 Publication Date: Published on Dec 21, 2021
🔹 Abstract:
A fine-tuning solution for generalized few-shot semantic segmentation improves performance beyond meta-learning by addressing saturation and minimizing the performance gap between novel and base categories. AI-generated summary Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning , it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2112.10982
• PDF: https://arxiv.org/pdf/2112.10982
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning
Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes....
❤3
🔹 Title:
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
🔹 Publication Date: Published on Jul 10
🔹 Abstract:
RLEP, a reinforcement learning framework with experience replay, enhances large language model training by focusing on high-quality examples, leading to faster convergence and improved performance on math-related benchmarks. AI-generated summary Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay \, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance . On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07451
• PDF: https://arxiv.org/pdf/2507.07451
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
🔹 Publication Date: Published on Jul 10
🔹 Abstract:
RLEP, a reinforcement learning framework with experience replay, enhances large language model training by focusing on high-quality examples, leading to faster convergence and improved performance on math-related benchmarks. AI-generated summary Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay \, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance . On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07451
• PDF: https://arxiv.org/pdf/2507.07451
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - Kwai-Klear/RLEP: RL with Experience Replay
RL with Experience Replay. Contribute to Kwai-Klear/RLEP development by creating an account on GitHub.
❤1
🔹 Title:
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
🔹 Publication Date: Published on Jul 16
🔹 Abstract:
A large-scale benchmark, MMHU, is proposed for human behavior analysis in autonomous driving, featuring rich annotations and diverse data sources, and benchmarking multiple tasks including motion prediction and behavior question answering. AI-generated summary Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behaviorx2014such as motion, trajectories, and intentionx2014a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose MMHU, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo , in-the-wild videos from YouTube , and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasksx2014ranging from motion prediction to motion generation and human behavior question answering x2014thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.12463
• PDF: https://arxiv.org/pdf/2507.12463
• Project Page: https://mmhu-benchmark.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
🔹 Publication Date: Published on Jul 16
🔹 Abstract:
A large-scale benchmark, MMHU, is proposed for human behavior analysis in autonomous driving, featuring rich annotations and diverse data sources, and benchmarking multiple tasks including motion prediction and behavior question answering. AI-generated summary Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behaviorx2014such as motion, trajectories, and intentionx2014a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose MMHU, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo , in-the-wild videos from YouTube , and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasksx2014ranging from motion prediction to motion generation and human behavior question answering x2014thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.12463
• PDF: https://arxiv.org/pdf/2507.12463
• Project Page: https://mmhu-benchmark.github.io/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Article Date: 9 May 2025
Article Description:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.06386v1.pdf
GitHub:
• https://github.com/apple/embedding-atlas
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Article Date: 9 May 2025
Article Description:
Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.06386v1.pdf
GitHub:
• https://github.com/apple/embedding-atlas
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Article Title:
OmniGen2: Exploration to Advanced Multimodal Generation
Article Date: 23 Jun 2025
Article Description:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.18871v2.pdf
GitHub:
• https://github.com/vectorspacelab/omnigen2
Datasets:
• MM-Vet
• GenEval
• MagicBrush
• ImgEdit
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
OmniGen2: Exploration to Advanced Multimodal Generation
Article Date: 23 Jun 2025
Article Description:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.18871v2.pdf
GitHub:
• https://github.com/vectorspacelab/omnigen2
Datasets:
• MM-Vet
• GenEval
• MagicBrush
• ImgEdit
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3