πΉ Title:
Learning to Skip the Middle Layers of Transformers
πΉ Publication Date: Published on Jun 26
πΉ Abstract:
A novel conditional computation architecture for Transformers dynamically skips middle layers based on input and a gating mechanism, but does not outperform dense baselines in reducing computational cost or improving validation performance. AI-generated summary Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers ) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions . Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions . Residual norms are controlled with a 'sandwich' or ' perilayernorm ' scheme and gate sparsity with an adaptive regularization loss . We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.21103
β’ PDF: https://arxiv.org/pdf/2506.21103
β’ Github: https://github.com/tim-lawson/skip-middle
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Learning to Skip the Middle Layers of Transformers
πΉ Publication Date: Published on Jun 26
πΉ Abstract:
A novel conditional computation architecture for Transformers dynamically skips middle layers based on input and a gating mechanism, but does not outperform dense baselines in reducing computational cost or improving validation performance. AI-generated summary Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers ) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions . Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions . Residual norms are controlled with a 'sandwich' or ' perilayernorm ' scheme and gate sparsity with an adaptive regularization loss . We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.21103
β’ PDF: https://arxiv.org/pdf/2506.21103
β’ Github: https://github.com/tim-lawson/skip-middle
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Learning to Skip the Middle Layers of Transformers
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently...
β€2
πΉ Title:
Radii, masses, and transit-timing variations of the three-planet system orbiting the naked-eye star TOI-396
πΉ Publication Date: Published on Nov 22, 2024
πΉ Abstract:
Observations of TOI-396 reveal three similar-sized planets with the outermost being the densest, and indicate that the inner two planets are close to but not in a 5:3 MMR, with significant TTVs detected. AI-generated summary TOI-396 is an F6V star (Vapprox6.4) orbited by three transiting planets. The orbital periods of the two innermost planets are close to the 5:3 commensurability (P_b sim3.6 d and P_c sim6.0 d). To measure the masses of the three planets, refine their radii, and investigate whether planets b and c are in MMR , we carried out HARPS RV obse rv ations and retrieved photometric data from TESS. We extracted the RV s via a skew-normal fit onto the HARPS CCFs and performed an MCMC joint analysis of the Doppler measurements and transit photometry, while employing the breakpoint method to remove stellar activity from the RV time series. We also performed a thorough TTV dynamical analysis of the system. Our analysis confirms that the three planets have similar sizes: R_b=2.004_{-0.047}^{+0.045}R_{oplus}; R_c=1.979_{-0.051}^{+0.054}R_{oplus}; R_d=2.001_{-0.064}^{+0.063}R_{oplus}. For the first time, we have determined the RV masses for TOI-396b and d: M_b=3.55_{-0.96}^{+0.94}M_{oplus} (rho_b=2.44_{-0.68}^{+0.69} g cm^{-3}) and M_d=7.1pm1.6M_{oplus} (rho_d=4.9_{-1.1}^{+1.2} g cm^{-3}). Our results suggest a quite unusual system architecture, with the outermost planet being the densest. The Doppler reflex motion induced by TOI-396c remains undetected in our RV time series, likely due to the proximity of P_c to the star's rotation period (P_{rot}=6.7pm1.3 d). We also discovered that TOI-396b and c display significant TTV s. While the TTV dynamical analysis returns a formally precise mass for TOI-396c (M_{c,dyn}=2.24^{+0.13}_{-0.67}M_{oplus}), the result might not be accurate owing to the poor sampling of the TTV phase. We also conclude that TOI-396b and c are close to but out of the 5:3 MMR . Our numerical simulation suggests TTV semi-amplitudes of up to 5 hours over a temporal baseline of sim5.2 years.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2411.14911
β’ PDF: https://arxiv.org/pdf/2411.14911
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Radii, masses, and transit-timing variations of the three-planet system orbiting the naked-eye star TOI-396
πΉ Publication Date: Published on Nov 22, 2024
πΉ Abstract:
Observations of TOI-396 reveal three similar-sized planets with the outermost being the densest, and indicate that the inner two planets are close to but not in a 5:3 MMR, with significant TTVs detected. AI-generated summary TOI-396 is an F6V star (Vapprox6.4) orbited by three transiting planets. The orbital periods of the two innermost planets are close to the 5:3 commensurability (P_b sim3.6 d and P_c sim6.0 d). To measure the masses of the three planets, refine their radii, and investigate whether planets b and c are in MMR , we carried out HARPS RV obse rv ations and retrieved photometric data from TESS. We extracted the RV s via a skew-normal fit onto the HARPS CCFs and performed an MCMC joint analysis of the Doppler measurements and transit photometry, while employing the breakpoint method to remove stellar activity from the RV time series. We also performed a thorough TTV dynamical analysis of the system. Our analysis confirms that the three planets have similar sizes: R_b=2.004_{-0.047}^{+0.045}R_{oplus}; R_c=1.979_{-0.051}^{+0.054}R_{oplus}; R_d=2.001_{-0.064}^{+0.063}R_{oplus}. For the first time, we have determined the RV masses for TOI-396b and d: M_b=3.55_{-0.96}^{+0.94}M_{oplus} (rho_b=2.44_{-0.68}^{+0.69} g cm^{-3}) and M_d=7.1pm1.6M_{oplus} (rho_d=4.9_{-1.1}^{+1.2} g cm^{-3}). Our results suggest a quite unusual system architecture, with the outermost planet being the densest. The Doppler reflex motion induced by TOI-396c remains undetected in our RV time series, likely due to the proximity of P_c to the star's rotation period (P_{rot}=6.7pm1.3 d). We also discovered that TOI-396b and c display significant TTV s. While the TTV dynamical analysis returns a formally precise mass for TOI-396c (M_{c,dyn}=2.24^{+0.13}_{-0.67}M_{oplus}), the result might not be accurate owing to the poor sampling of the TTV phase. We also conclude that TOI-396b and c are close to but out of the 5:3 MMR . Our numerical simulation suggests TTV semi-amplitudes of up to 5 hours over a temporal baseline of sim5.2 years.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2411.14911
β’ PDF: https://arxiv.org/pdf/2411.14911
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Radii, masses, and transit-timing variations of the three-planet...
TOI-396 is an F6V star ($V\approx6.4$) orbited by three transiting planets. The orbital periods of the two innermost planets are close to the 5:3 commensurability ($P_b \sim3.6$ d and $P_c...
β€4
πΉ Title:
AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models
πΉ Publication Date: Published on Jun 24
πΉ Abstract:
AnimaX creates multi-skeleton 3D animations by blending video diffusion model priors with skeleton-based control, using joint video-pose diffusion and shared positional encodings. AI-generated summary We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation . Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces . In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps , and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt . We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics . Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation . Project page: https://anima-x.github.io/{https://anima-x.github.io/}.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.19851
β’ PDF: https://arxiv.org/pdf/2506.19851
β’ Project Page: https://anima-x.github.io/
β’ Github: https://github.com/anima-x/anima-x
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models
πΉ Publication Date: Published on Jun 24
πΉ Abstract:
AnimaX creates multi-skeleton 3D animations by blending video diffusion model priors with skeleton-based control, using joint video-pose diffusion and shared positional encodings. AI-generated summary We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation . Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces . In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps , and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt . We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics . Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation . Project page: https://anima-x.github.io/{https://anima-x.github.io/}.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.19851
β’ PDF: https://arxiv.org/pdf/2506.19851
β’ Project Page: https://anima-x.github.io/
β’ Github: https://github.com/anima-x/anima-x
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
Please open Telegram to view this post
VIEW IN TELEGRAM
β€1
Article Title:
MNN: A Universal and Efficient Inference Engine
Article Date: 27 Feb 2020
Article Description:
Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight. Extensive benchmark experiments demonstrate that MNN performs favorably against other popular lightweight deep learning frameworks. MNN is available to public at: https://github.com/alibaba/MNN.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2002.12418v1.pdf
GitHub:
β’ https://github.com/alibaba/MNN
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MNN: A Universal and Efficient Inference Engine
Article Date: 27 Feb 2020
Article Description:
Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. In this paper, the contributions of MNN include: (1) presenting a mechanism called pre-inference that manages to conduct runtime optimization; (2)deliveringthorough kernel optimization on operators to achieve optimal computation performance; (3) introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight. Extensive benchmark experiments demonstrate that MNN performs favorably against other popular lightweight deep learning frameworks. MNN is available to public at: https://github.com/alibaba/MNN.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2002.12418v1.pdf
GitHub:
β’ https://github.com/alibaba/MNN
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
π₯ The coolest AI bot on Telegram
π’ Completely free and knows everything, from simple questions to complex problems.
βοΈ Helps you with anything in the easiest and fastest way possible.
β¨οΈ You can even choose girlfriend or boyfriend mode and chat as if youβre talking to a real person π
π΅ Includes weekly and monthly airdrops!βοΈ
π΅βπ« Bot ID: @chatgpt_officialbot
π The best part is, even group admins can use it right inside their groups! β¨
πΊ Try now:
β’ Type
β’ Type
β’ Type
Or just say
π’ Completely free and knows everything, from simple questions to complex problems.
βοΈ Helps you with anything in the easiest and fastest way possible.
β¨οΈ You can even choose girlfriend or boyfriend mode and chat as if youβre talking to a real person π
π΅ Includes weekly and monthly airdrops!βοΈ
π΅βπ« Bot ID: @chatgpt_officialbot
π The best part is, even group admins can use it right inside their groups! β¨
πΊ Try now:
β’ Type
FunFact!
for a jaw-dropping AI trivia.β’ Type
RecipePlease!
for a quick, tasty meal idea.β’ Type
JokeTime!
for an instant laugh.Or just say
Surprise me!
and I'll pick something awesome for you. π€β¨Article Title:
Efficient Part-level 3D Object Generation via Dual Volume Packing
Article Date: 11 Jun 2025
Article Description:
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.09980v1.pdf
GitHub:
β’ https://github.com/nvlabs/partpacker
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Efficient Part-level 3D Object Generation via Dual Volume Packing
Article Date: 11 Jun 2025
Article Description:
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.09980v1.pdf
GitHub:
β’ https://github.com/nvlabs/partpacker
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
πΉ Title:
Transformers without Normalization
πΉ Publication Date: Published on Mar 13
πΉ Abstract:
Dynamic Tanh (DyT) replaces normalization layers in Transformers, achieving equivalent or superior performance without hyperparameter tuning across various tasks. AI-generated summary Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh ( DyT ), an element-wise operation DyT (x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers . DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT , Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning . We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation , supervised to self-supervised learning , and computer vision to language models . These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2503.10622
β’ PDF: https://arxiv.org/pdf/2503.10622
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Transformers without Normalization
πΉ Publication Date: Published on Mar 13
πΉ Abstract:
Dynamic Tanh (DyT) replaces normalization layers in Transformers, achieving equivalent or superior performance without hyperparameter tuning across various tasks. AI-generated summary Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh ( DyT ), an element-wise operation DyT (x) = tanh(alpha x), as a drop-in replacement for normalization layers in Transformers . DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT , Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning . We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation , supervised to self-supervised learning , and computer vision to language models . These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2503.10622
β’ PDF: https://arxiv.org/pdf/2503.10622
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
Forwarded from Python | Machine Learning | Coding | R
This channels is for Programmers, Coders, Software Engineers.
0οΈβ£ Python
1οΈβ£ Data Science
2οΈβ£ Machine Learning
3οΈβ£ Data Visualization
4οΈβ£ Artificial Intelligence
5οΈβ£ Data Analysis
6οΈβ£ Statistics
7οΈβ£ Deep Learning
8οΈβ£ programming Languages
β
https://t.iss.one/addlist/8_rRW2scgfRhOTc0
β
https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
πΉ Title:
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
πΉ Publication Date: Published on Jun 19
πΉ Abstract:
GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models. AI-generated summary Recent reinforcement learning approaches, such as outcome-supervised GRPO , have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1 , a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution , cross-environment , and cross-environment-task scenarios. Using SEED-Bench-R1 , we find that standard GRPO, while improving answer accuracy , often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts , and strict KL penalties limiting exploration .To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward : (1) a base reward for answer correctness, and (2) an adaptive consistency bonus , computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1 , achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability , improving model performance across diverse video understanding benchmarks . Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/pdf/2506.16141
β’ PDF: https://arxiv.org/pdf/2506.16141
β’ Github: https://github.com/TencentARC/GRPO-CARE
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
πΉ Publication Date: Published on Jun 19
πΉ Abstract:
GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models. AI-generated summary Recent reinforcement learning approaches, such as outcome-supervised GRPO , have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1 , a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution , cross-environment , and cross-environment-task scenarios. Using SEED-Bench-R1 , we find that standard GRPO, while improving answer accuracy , often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts , and strict KL penalties limiting exploration .To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward : (1) a base reward for answer correctness, and (2) an adaptive consistency bonus , computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1 , achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability , improving model performance across diverse video understanding benchmarks . Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/pdf/2506.16141
β’ PDF: https://arxiv.org/pdf/2506.16141
β’ Github: https://github.com/TencentARC/GRPO-CARE
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
Article Title:
MoonCast: High-Quality Zero-Shot Podcast Generation
Article Date: 18 Mar 2025
Article Description:
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.14345v2.pdf
GitHub:
β’ https://github.com/jzq2000/mooncast
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MoonCast: High-Quality Zero-Shot Podcast Generation
Article Date: 18 Mar 2025
Article Description:
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.14345v2.pdf
GitHub:
β’ https://github.com/jzq2000/mooncast
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
Article Title:
Article Date: 9 Jun 2025
Article Description:
Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07918v1.pdf
GitHub:
β’ https://github.com/vdblm/CausalPFN
Datasets:
β’ IHDP
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Article Date: 9 Jun 2025
Article Description:
Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07918v1.pdf
GitHub:
β’ https://github.com/vdblm/CausalPFN
Datasets:
β’ IHDP
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - vdblm/CausalPFN: CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
CausalPFN: Amortized Causal Effect Estimation via In-Context Learning - vdblm/CausalPFN
πΉ Title:
Use Property-Based Testing to Bridge LLM Code Generation and Validation
πΉ Publication Date: Published on Jun 23
πΉ Abstract:
A novel framework using Property-Based Testing and collaborative LLM-based agents improves code generation correctness and generalization. AI-generated summary Large Language Models (LLMs) excel at code generation , but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing ( PBT ) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the "cycle of self-deception" where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations . The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.18315
β’ PDF: https://arxiv.org/pdf/2506.18315
β’ Github: https://github.com/HeLeHanPrivate/PBTwithCodeGen
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Use Property-Based Testing to Bridge LLM Code Generation and Validation
πΉ Publication Date: Published on Jun 23
πΉ Abstract:
A novel framework using Property-Based Testing and collaborative LLM-based agents improves code generation correctness and generalization. AI-generated summary Large Language Models (LLMs) excel at code generation , but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing ( PBT ) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the "cycle of self-deception" where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations . The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.18315
β’ PDF: https://arxiv.org/pdf/2506.18315
β’ Github: https://github.com/HeLeHanPrivate/PBTwithCodeGen
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Use Property-Based Testing to Bridge LLM Code Generation and Validation
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional...
Article Title:
TinyLlama: An Open-Source Small Language Model
Article Date: 4 Jan 2024
Article Description:
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2401.02385v2.pdf
GitHub:
β’ https://github.com/Lightning-AI/lit-gpt
β’ https://github.com/jzhang38/tinyllama
Datasets:
β’ MML
β’ HellaSwag
β’ PIQA
β’ BoolQ
β’ WinoGrande
β’ DROP
β’ BIG-bench
β’ BBH
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
TinyLlama: An Open-Source Small Language Model
Article Date: 4 Jan 2024
Article Description:
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2401.02385v2.pdf
GitHub:
β’ https://github.com/Lightning-AI/lit-gpt
β’ https://github.com/jzhang38/tinyllama
Datasets:
β’ MML
β’ HellaSwag
β’ PIQA
β’ BoolQ
β’ WinoGrande
β’ DROP
β’ BIG-bench
β’ BBH
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
πΉ Title:
MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models
πΉ Publication Date: Published on Jun 6
πΉ Abstract:
A heterogeneous Mixture-of-Adapters (MoA) approach enhances parameter-efficient fine-tuning in LLMs by integrating diverse adapter experts, outperforming homogeneous MoE-LoRA methods. AI-generated summary Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ homogeneous MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance , which negatively impact the potential of LLMs. To address these challenges, we propose a heterogeneous Mixture-of-Adapters (MoA) approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: (i) Soft MoA achieves fine-grained integration by performing a weighted fusion of all expert outputs; (ii) Sparse MoA activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.05928
β’ PDF: https://arxiv.org/pdf/2506.05928
β’ Github: https://github.com/DCDmllm/MoA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models
πΉ Publication Date: Published on Jun 6
πΉ Abstract:
A heterogeneous Mixture-of-Adapters (MoA) approach enhances parameter-efficient fine-tuning in LLMs by integrating diverse adapter experts, outperforming homogeneous MoE-LoRA methods. AI-generated summary Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ homogeneous MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance , which negatively impact the potential of LLMs. To address these challenges, we propose a heterogeneous Mixture-of-Adapters (MoA) approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: (i) Soft MoA achieves fine-grained integration by performing a weighted fusion of all expert outputs; (ii) Sparse MoA activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.05928
β’ PDF: https://arxiv.org/pdf/2506.05928
β’ Github: https://github.com/DCDmllm/MoA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
GitHub - DCDmllm/MoA
Contribute to DCDmllm/MoA development by creating an account on GitHub.
β€1
πΉ Title:
A Multimodal Automated Interpretability Agent
πΉ Publication Date: Published on Apr 22, 2024
πΉ Abstract:
MAIA, a multimodal automated interpretability agent, uses neural models to perform feature interpretation and failure mode discovery for other models, demonstrating comparable results to human experimenters and aiding in reducing sensitivity to spurious features and identifying potential misclassifications. AI-generated summary This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery . It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features , and automatically identifying inputs likely to be mis-classified.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2404.14394
β’ PDF: https://arxiv.org/pdf/2404.14394
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
A Multimodal Automated Interpretability Agent
πΉ Publication Date: Published on Apr 22, 2024
πΉ Abstract:
MAIA, a multimodal automated interpretability agent, uses neural models to perform feature interpretation and failure mode discovery for other models, demonstrating comparable results to human experimenters and aiding in reducing sensitivity to spurious features and identifying potential misclassifications. AI-generated summary This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery . It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features , and automatically identifying inputs likely to be mis-classified.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2404.14394
β’ PDF: https://arxiv.org/pdf/2404.14394
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
A Multimodal Automated Interpretability Agent
This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and...
β€5
Forwarded from Python | Machine Learning | Coding | R
βοΈ JAY HELPS EVERYONE EARN MONEY!$29,000 HE'S GIVING AWAY TODAY!
Everyone can join his channel and make money! He gives away from $200 to $5.000 every day in his channel
https://t.iss.one/+LgzKy2hA4eY0YWNl
β‘οΈFREE ONLY FOR THE FIRST 500 SUBSCRIBERS! FURTHER ENTRY IS PAID! ππ
https://t.iss.one/+LgzKy2hA4eY0YWNl
Everyone can join his channel and make money! He gives away from $200 to $5.000 every day in his channel
https://t.iss.one/+LgzKy2hA4eY0YWNl
β‘οΈFREE ONLY FOR THE FIRST 500 SUBSCRIBERS! FURTHER ENTRY IS PAID! ππ
https://t.iss.one/+LgzKy2hA4eY0YWNl
β€1
πΉ Title:
Scaling Test-time Compute for LLM Agents
πΉ Publication Date: Published on Jun 15
πΉ Abstract:
Systematic exploration of test-time scaling methods in large language agents reveals that computational scaling improves performance, especially through parallel sampling, sequential revision, effective verification, and increased rollout diversity. AI-generated summary Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms ; (2) sequential revision strategies; (3) verifiers and merging methods ; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.12928
β’ PDF: https://arxiv.org/pdf/2506.12928
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Scaling Test-time Compute for LLM Agents
πΉ Publication Date: Published on Jun 15
πΉ Abstract:
Systematic exploration of test-time scaling methods in large language agents reveals that computational scaling improves performance, especially through parallel sampling, sequential revision, effective verification, and increased rollout diversity. AI-generated summary Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms ; (2) sequential revision strategies; (3) verifiers and merging methods ; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.12928
β’ PDF: https://arxiv.org/pdf/2506.12928
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Scaling Test-time Compute for LLM Agents
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying...
β€3
Article Title:
Learning Compact Vision Tokens for Efficient Large Multimodal Models
Article Date: 8 Jun 2025
Article Description:
Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07138v1.pdf
GitHub:
β’ https://github.com/visresearch/LLaVA-STF
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Learning Compact Vision Tokens for Efficient Large Multimodal Models
Article Date: 8 Jun 2025
Article Description:
Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07138v1.pdf
GitHub:
β’ https://github.com/visresearch/LLaVA-STF
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
πΉ Title:
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
πΉ Publication Date: Published on Jun 19
πΉ Abstract:
GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models. AI-generated summary Recent reinforcement learning approaches, such as outcome-supervised GRPO , have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1 , a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution , cross-environment , and cross-environment-task scenarios. Using SEED-Bench-R1 , we find that standard GRPO, while improving answer accuracy , often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts , and strict KL penalties limiting exploration .To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward : (1) a base reward for answer correctness, and (2) an adaptive consistency bonus , computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1 , achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability , improving model performance across diverse video understanding benchmarks . Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/pdf/2506.16141
β’ PDF: https://arxiv.org/pdf/2506.16141
β’ Github: https://github.com/TencentARC/GRPO-CARE
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
πΉ Publication Date: Published on Jun 19
πΉ Abstract:
GRPO-CARE, a reinforcement learning framework optimizing for consistency and correctness, outperforms standard GRPO on a new video understanding benchmark, SEED-Bench-R1, improving both performance and logical coherence in multimodal large language models. AI-generated summary Recent reinforcement learning approaches, such as outcome-supervised GRPO , have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1 , a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution , cross-environment , and cross-environment-task scenarios. Using SEED-Bench-R1 , we find that standard GRPO, while improving answer accuracy , often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts , and strict KL penalties limiting exploration .To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward : (1) a base reward for answer correctness, and (2) an adaptive consistency bonus , computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1 , achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability , improving model performance across diverse video understanding benchmarks . Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
πΉ Links:
β’ arXiv Page: https://arxiv.org/pdf/2506.16141
β’ PDF: https://arxiv.org/pdf/2506.16141
β’ Github: https://github.com/TencentARC/GRPO-CARE
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€5