Article Title:
SOAP: Style-Omniscient Animatable Portraits
Article Date: 8 May 2025
Article Description:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.05022v2.pdf
GitHub:
• https://github.com/tingtingliao/soap
Datasets:
• NeRF
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SOAP: Style-Omniscient Animatable Portraits
Article Date: 8 May 2025
Article Description:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.05022v2.pdf
GitHub:
• https://github.com/tingtingliao/soap
Datasets:
• NeRF
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Article Date: 29 May 2025
Article Description:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf
GitHub:
• https://github.com/jennyzzt/dgm
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Article Date: 29 May 2025
Article Description:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf
GitHub:
• https://github.com/jennyzzt/dgm
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5
🔹 Title:
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
AniMaker, a multi-agent framework using MCTS-Gen and AniEval, generates coherent storytelling videos from text input, outperforming existing models with better quality and efficiency. AI-generated summary Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent , an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent , the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion , and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10540
• PDF: https://arxiv.org/pdf/2506.10540
• Github: https://github.com/HITsz-TMG/Anim-Director
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
AniMaker, a multi-agent framework using MCTS-Gen and AniEval, generates coherent storytelling videos from text input, outperforming existing models with better quality and efficiency. AI-generated summary Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent , an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent , the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion , and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10540
• PDF: https://arxiv.org/pdf/2506.10540
• Github: https://github.com/HITsz-TMG/Anim-Director
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
AniMaker: Automated Multi-Agent Animated Storytelling with...
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert...
❤4
Article Title:
Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
Article Date: 5 Jun 2025
Article Description:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05280v1.pdf
GitHub:
• https://github.com/bigcileng/bilateral-driving
Datasets:
• NeRF
• nuScenes
• PandaSet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
Article Date: 5 Jun 2025
Article Description:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05280v1.pdf
GitHub:
• https://github.com/bigcileng/bilateral-driving
Datasets:
• NeRF
• nuScenes
• PandaSet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Visual Causal Scene Refinement for Video Question Answering
Article Date: 7 May 2023
Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf
GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• NExT-QA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Visual Causal Scene Refinement for Video Question Answering
Article Date: 7 May 2023
Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf
GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• NExT-QA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Article Date: 23 May 2025
Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf
GitHub:
• https://github.com/DreamTechAI/Direct3D-S2
Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Article Date: 23 May 2025
Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf
GitHub:
• https://github.com/DreamTechAI/Direct3D-S2
Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👏1
🔹 Title:
Sparsified State-Space Models are Efficient Highway Networks
🔹 Publication Date: Published on May 27
🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Sparsified State-Space Models are Efficient Highway Networks
🔹 Publication Date: Published on May 27
🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
woominsong - Overview
woominsong has 38 repositories available. Follow their code on GitHub.
❤3
Article Title:
LeVo: High-Quality Song Generation with Multi-Preference Alignment
Article Date: 9 Jun 2025
Article Description:
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07520v2.pdf
GitHub:
• https://github.com/tencent-ailab/songgeneration
Datasets:
• 100style
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
LeVo: High-Quality Song Generation with Multi-Preference Alignment
Article Date: 9 Jun 2025
Article Description:
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07520v2.pdf
GitHub:
• https://github.com/tencent-ailab/songgeneration
Datasets:
• 100style
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Forwarded from Python | Machine Learning | Coding | R
This channels is for Programmers, Coders, Software Engineers.
0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages
✅ https://t.iss.one/addlist/8_rRW2scgfRhOTc0
✅ https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
❤2
🔹 Title:
MoCha: Towards Movie-Grade Talking Character Synthesis
🔹 Publication Date: Published on Mar 30
🔹 Abstract:
MoCha generates realistic talking character animations from speech and text using a speech-video attention mechanism and joint training on speech-labeled and text-labeled data, enabling multi-character conversations and superior realism. AI-generated summary Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha , the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence . Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.23307
• PDF: https://arxiv.org/pdf/2503.23307
• Project Page: https://congwei1230.github.io/MoCha/
• Github: https://github.com/congwei1230/MoChaBench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/CongWei1230/MoCha-Generation-on-MoChaBench-Visualizer
• https://huggingface.co/datasets/CongWei1230/MoChaBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MoCha: Towards Movie-Grade Talking Character Synthesis
🔹 Publication Date: Published on Mar 30
🔹 Abstract:
MoCha generates realistic talking character animations from speech and text using a speech-video attention mechanism and joint training on speech-labeled and text-labeled data, enabling multi-character conversations and superior realism. AI-generated summary Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha , the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence . Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.23307
• PDF: https://arxiv.org/pdf/2503.23307
• Project Page: https://congwei1230.github.io/MoCha/
• Github: https://github.com/congwei1230/MoChaBench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/CongWei1230/MoCha-Generation-on-MoChaBench-Visualizer
• https://huggingface.co/datasets/CongWei1230/MoChaBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
MoCha: Towards Movie-Grade Talking Character Synthesis
Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We...
❤4
🔹 Title:
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Researchers propose design principles for cascaded video super-resolution models to improve high-resolution video generation by introduces degradation strategies, timestep sampling, noise augmentation, and interleaving temporal units with sparse local attention. AI-generated summary Latent diffusion models have emerged as a leading paradigm for efficient video generation . However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution ( VSR ) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19838
• PDF: https://arxiv.org/pdf/2506.19838
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution
🔹 Publication Date: Published on Jun 24
🔹 Abstract:
Researchers propose design principles for cascaded video super-resolution models to improve high-resolution video generation by introduces degradation strategies, timestep sampling, noise augmentation, and interleaving temporal units with sparse local attention. AI-generated summary Latent diffusion models have emerged as a leading paradigm for efficient video generation . However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution ( VSR ) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19838
• PDF: https://arxiv.org/pdf/2506.19838
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
Forwarded from Python | Machine Learning | Coding | R
Top 50 LLM Interview Questions!
A comprehensive resource that covers traditional ML basics, model architectures, real-world case studies, and theoretical foundations.
👇👇👇👇👇👇
A comprehensive resource that covers traditional ML basics, model architectures, real-world case studies, and theoretical foundations.
👇👇👇👇👇👇
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
Forwarded from Python | Machine Learning | Coding | R
LLM Interview Questions.pdf
71.2 KB
Top 50 LLM Interview Questions!
#LLM #AIInterviews #MachineLearning #DeepLearning #NLP #LLMInterviewPrep #ModelArchitectures #AITheory #TechInterviews #MLBasics #InterviewQuestions #LargeLanguageModels
✉️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
Please open Telegram to view this post
VIEW IN TELEGRAM
❤6👍1
🔹 Title:
Is a PET all you need? A multi-modal study for Alzheimer's disease using 3D CNNs
🔹 Publication Date: Published on Jul 5, 2022
🔹 Abstract:
A systematic evaluation of multi-modal deep neural networks for Alzheimer's disease diagnosis shows that FDG-PET performs better than sMRI and that multi-modal fusion does not improve accuracy. AI-generated summary Alzheimer's Disease (AD) is the most common form of dementia and often difficult to diagnose due to the multifactorial etiology of dementia. Recent works on neuroimaging-based computer-aided diagnosis with deep neural networks (DNNs) showed that fusing structural magnetic resonance images (sMRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) leads to improved accuracy in a study population of healthy controls and subjects with AD. However, this result conflicts with the established clinical knowledge that FDG-PET better captures AD-specific pathologies than sMRI. Therefore, we propose a framework for the systematic evaluation of multi-modal DNNs and critically re-evaluate single- and multi-modal DNNs based on FDG-PET and sMRI for binary healthy vs. AD, and three-way healthy / mild cognitive impairment /AD classification. Our experiments demonstrate that a single-modal ity network using FDG-PET performs better than MRI (accuracy 0.91 vs 0.87) and does not show improvement when combined. This conforms with the established clinical knowledge on AD biomarkers, but raises questions about the true benefit of multi-modal DNNs. We argue that future work on multi-modal fusion should systematically assess the contribution of individual modalities following our proposed evaluation framework. Finally, we encourage the community to go beyond healthy vs. AD classification and focus on differential diagnosis of dementia, where fusing multi-modal image information conforms with a clinical need.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2207.02094
• PDF: https://arxiv.org/pdf/2207.02094
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Is a PET all you need? A multi-modal study for Alzheimer's disease using 3D CNNs
🔹 Publication Date: Published on Jul 5, 2022
🔹 Abstract:
A systematic evaluation of multi-modal deep neural networks for Alzheimer's disease diagnosis shows that FDG-PET performs better than sMRI and that multi-modal fusion does not improve accuracy. AI-generated summary Alzheimer's Disease (AD) is the most common form of dementia and often difficult to diagnose due to the multifactorial etiology of dementia. Recent works on neuroimaging-based computer-aided diagnosis with deep neural networks (DNNs) showed that fusing structural magnetic resonance images (sMRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) leads to improved accuracy in a study population of healthy controls and subjects with AD. However, this result conflicts with the established clinical knowledge that FDG-PET better captures AD-specific pathologies than sMRI. Therefore, we propose a framework for the systematic evaluation of multi-modal DNNs and critically re-evaluate single- and multi-modal DNNs based on FDG-PET and sMRI for binary healthy vs. AD, and three-way healthy / mild cognitive impairment /AD classification. Our experiments demonstrate that a single-modal ity network using FDG-PET performs better than MRI (accuracy 0.91 vs 0.87) and does not show improvement when combined. This conforms with the established clinical knowledge on AD biomarkers, but raises questions about the true benefit of multi-modal DNNs. We argue that future work on multi-modal fusion should systematically assess the contribution of individual modalities following our proposed evaluation framework. Finally, we encourage the community to go beyond healthy vs. AD classification and focus on differential diagnosis of dementia, where fusing multi-modal image information conforms with a clinical need.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2207.02094
• PDF: https://arxiv.org/pdf/2207.02094
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Is a PET all you need? A multi-modal study for Alzheimer's...
Alzheimer's Disease (AD) is the most common form of dementia and often difficult to diagnose due to the multifactorial etiology of dementia. Recent works on neuroimaging-based computer-aided...
❤4
🔹 Title:
Solving Inequality Proofs with Large Language Models
🔹 Publication Date: Published on Jun 9
🔹 Abstract:
The investigation into inequality proving using large language models uncovers significant challenges in constructing rigorous proofs, revealing gaps between finding answers and generating valid step-wise solutions. AI-generated summary Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models ( LLMs ), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction . Building on this, we release IneqMath , an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement . Code and data are available at https:// ineqmath .github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07927
• PDF: https://arxiv.org/pdf/2506.07927
• Github: https://ineqmath.github.io/#visualization
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/AI4Math/IneqMath
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Solving Inequality Proofs with Large Language Models
🔹 Publication Date: Published on Jun 9
🔹 Abstract:
The investigation into inequality proving using large language models uncovers significant challenges in constructing rigorous proofs, revealing gaps between finding answers and generating valid step-wise solutions. AI-generated summary Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models ( LLMs ), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction . Building on this, we release IneqMath , an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement . Code and data are available at https:// ineqmath .github.io/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07927
• PDF: https://arxiv.org/pdf/2506.07927
• Github: https://ineqmath.github.io/#visualization
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/AI4Math/IneqMath
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸
Join our channel today for free! Tomorrow it will cost 500$!
https://t.iss.one/+Cl8uwGkD0l5lMGNl
You can join at this link! 👆👇
https://t.iss.one/+Cl8uwGkD0l5lMGNl
Join our channel today for free! Tomorrow it will cost 500$!
https://t.iss.one/+Cl8uwGkD0l5lMGNl
You can join at this link! 👆👇
https://t.iss.one/+Cl8uwGkD0l5lMGNl
❤1
Article Title:
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
• https://github.com/tauricresearch/tradingagents
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
TradingAgents: Multi-Agents LLM Financial Trading Framework
Article Date: 28 Dec 2024
Article Description:
Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems handling specific tasks or multi-agent frameworks independently gathering data. However, the multi-agent systems' potential to replicate real-world trading firms' collaborative dynamics remains underexplored. TradingAgents proposes a novel stock trading framework inspired by trading firms, featuring LLM-powered agents in specialized roles such as fundamental analysts, sentiment analysts, technical analysts, and traders with varied risk profiles. The framework includes Bull and Bear researcher agents assessing market conditions, a risk management team monitoring exposure, and traders synthesizing insights from debates and historical data to make informed decisions. By simulating a dynamic, collaborative trading environment, this framework aims to improve trading performance. Detailed architecture and extensive experiments reveal its superiority over baseline models, with notable improvements in cumulative returns, Sharpe ratio, and maximum drawdown, highlighting the potential of multi-agent LLM frameworks in financial trading. TradingAgents is available at https://github.com/TauricResearch/TradingAgents.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2412.20138v7.pdf
GitHub:
• https://github.com/tauricresearch/tradingagents
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🔹 Title:
Configurable Preference Tuning with Rubric-Guided Synthetic Data
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
Configurable Preference Tuning enables language models to dynamically adjust their behavior based on human-interprettable directives, using rubric-guided preference data for fine-tuning and inference-time modulation. AI-generated summary Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11702
• PDF: https://arxiv.org/pdf/2506.11702
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Configurable Preference Tuning with Rubric-Guided Synthetic Data
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
Configurable Preference Tuning enables language models to dynamically adjust their behavior based on human-interprettable directives, using rubric-guided preference data for fine-tuning and inference-time modulation. AI-generated summary Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11702
• PDF: https://arxiv.org/pdf/2506.11702
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Configurable Preference Tuning with Rubric-Guided Synthetic Data
Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper...
❤1
🔹 Title:
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
🔹 Publication Date: Published on Mar 10
🔹 Abstract:
PLADIS leverages sparse attention in cross-attention layers to enhance pre-trained text-to-image diffusion models, improving text alignment and human preference without additional training. AI-generated summary Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS , which boosts pre-trained models ( U-Net / Transformer ) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention , our PLADIS unleashes the latent potential of text-to-image diffusion models , enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models . Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.07677
• PDF: https://arxiv.org/pdf/2503.07677
• Github: https://cubeyoung.github.io/pladis-proejct/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
🔹 Publication Date: Published on Mar 10
🔹 Abstract:
PLADIS leverages sparse attention in cross-attention layers to enhance pre-trained text-to-image diffusion models, improving text alignment and human preference without additional training. AI-generated summary Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS , which boosts pre-trained models ( U-Net / Transformer ) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention , our PLADIS unleashes the latent potential of text-to-image diffusion models , enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models . Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.07677
• PDF: https://arxiv.org/pdf/2503.07677
• Github: https://cubeyoung.github.io/pladis-proejct/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
PLADIS: Pushing the Limits of Attention in Diffusion Models at...
Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often...
❤1
Article Title:
Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries
Article Date: 11 Jun 2025
Article Description:
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.09476v1.pdf
GitHub:
• https://github.com/tianxiang-hao/urban1960satseg
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries
Article Date: 11 Jun 2025
Article Description:
Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.09476v1.pdf
GitHub:
• https://github.com/tianxiang-hao/urban1960satseg
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
MMSearch-R1: Incentivizing LMMs to Search
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
MMSearch-R1, a reinforcement learning framework, enables large multimodal models to perform efficient, on-demand, multi-turn search in real-world environments, outperforming existing approaches. AI-generated summary Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty . To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20670
• PDF: https://arxiv.org/pdf/2506.20670
• Github: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MMSearch-R1: Incentivizing LMMs to Search
🔹 Publication Date: Published on Jun 25
🔹 Abstract:
MMSearch-R1, a reinforcement learning framework, enables large multimodal models to perform efficient, on-demand, multi-turn search in real-world environments, outperforming existing approaches. AI-generated summary Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty . To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.20670
• PDF: https://arxiv.org/pdf/2506.20670
• Github: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
MMSearch-R1: Incentivizing LMMs to Search
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information....
❤1