Article Title:
MAGREF: Masked Guidance for Any-Reference Video Generation
Article Date: 29 May 2025
Article Description:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREFPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23742v1.pdf
GitHub:
• https://github.com/magref-video/magref
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
MAGREF: Masked Guidance for Any-Reference Video Generation
Article Date: 29 May 2025
Article Description:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREFPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23742v1.pdf
GitHub:
• https://github.com/magref-video/magref
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification
Article Date: 12 Mar 2025
Article Description:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.09033v2.pdf
GitHub:
• https://github.com/kitoweeknd/RFUAV
Datasets:
• RFUAV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification
Article Date: 12 Mar 2025
Article Description:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.09033v2.pdf
GitHub:
• https://github.com/kitoweeknd/RFUAV
Datasets:
• RFUAV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
Forwarded from Python | Machine Learning | Coding | R
🚀 FREE IT Study Kits for 2025 — Grab Yours Now!
Just found these zero-cost resources from SPOTO👇
Perfect if you're prepping for #Cisco, #AWS, #PMP, #AI, #Python, #Excel, or #Cybersecurity!
✅ 100% Free
✅ No signup traps
✅ Instantly downloadable
📘 IT Certs E-book: https://bit.ly/4fJSoLP
☁️ Cloud & AI Kits: https://bit.ly/3F3lc5B
📊 Cybersecurity, Python & Excel: https://bit.ly/4mFrA4g
🧠 Skill Test (Free!): https://bit.ly/3PoKH39
Tag a friend & level up together 💪
🌐 Join the IT Study Group: https://chat.whatsapp.com/E3Vkxa19HPO9ZVkWslBO8s
📲 1-on-1 Exam Help: https://wa.link/k0vy3x
👑Last 24 HOURS to grab Mid-Year Mega Sale prices!Don’t miss Lucky Draw👇
https://bit.ly/43VgcbT
Just found these zero-cost resources from SPOTO👇
Perfect if you're prepping for #Cisco, #AWS, #PMP, #AI, #Python, #Excel, or #Cybersecurity!
✅ 100% Free
✅ No signup traps
✅ Instantly downloadable
📘 IT Certs E-book: https://bit.ly/4fJSoLP
☁️ Cloud & AI Kits: https://bit.ly/3F3lc5B
📊 Cybersecurity, Python & Excel: https://bit.ly/4mFrA4g
🧠 Skill Test (Free!): https://bit.ly/3PoKH39
Tag a friend & level up together 💪
🌐 Join the IT Study Group: https://chat.whatsapp.com/E3Vkxa19HPO9ZVkWslBO8s
📲 1-on-1 Exam Help: https://wa.link/k0vy3x
👑Last 24 HOURS to grab Mid-Year Mega Sale prices!Don’t miss Lucky Draw👇
https://bit.ly/43VgcbT
🔹 Title:
Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
🔹 Publication Date: Published on Jun 15
🔹 Abstract:
ChartIR uses structured instruction and iterative refinement to improve MLLM performance in chart-to-code generation by separating visual understanding and code translation tasks. AI-generated summary Recently, multimodal large language models ( MLLMs ) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction . First, we distinguish two tasks: visual understanding and code translation . To accomplish the visual understanding component, we design two types of structured instruction s: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations , thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement , enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.14837
• PDF: https://arxiv.org/pdf/2506.14837
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
🔹 Publication Date: Published on Jun 15
🔹 Abstract:
ChartIR uses structured instruction and iterative refinement to improve MLLM performance in chart-to-code generation by separating visual understanding and code translation tasks. AI-generated summary Recently, multimodal large language models ( MLLMs ) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction . First, we distinguish two tasks: visual understanding and code translation . To accomplish the visual understanding component, we design two types of structured instruction s: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations , thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement , enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.14837
• PDF: https://arxiv.org/pdf/2506.14837
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Improved Iterative Refinement for Chart-to-Code Generation via...
Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results...
❤1
🔹 Title:
Show-o2: Improved Native Unified Multimodal Models
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
Show-o2 leverages autoregressive modeling and flow matching within a 3D causal variational autoencoder to create unified visual representations for multimodal understanding and generation tasks. AI-generated summary This paper presents improved native unified multimodal models , i.e., Show-o2, that leverage autoregressive modeling and flow matching . Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model , autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation . A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15564
• PDF: https://arxiv.org/pdf/2506.15564
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/showlab/Show-o
• https://huggingface.co/spaces/svjack/Show-o
• https://huggingface.co/spaces/showlab/Show-o-512
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Show-o2: Improved Native Unified Multimodal Models
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
Show-o2 leverages autoregressive modeling and flow matching within a 3D causal variational autoencoder to create unified visual representations for multimodal understanding and generation tasks. AI-generated summary This paper presents improved native unified multimodal models , i.e., Show-o2, that leverage autoregressive modeling and flow matching . Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model , autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation . A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15564
• PDF: https://arxiv.org/pdf/2506.15564
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/showlab/Show-o
• https://huggingface.co/spaces/svjack/Show-o
• https://huggingface.co/spaces/showlab/Show-o-512
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👍2
🔹 Title:
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
Stream-Omni, a large multimodal model, integrates text, vision, and speech by efficiently aligning modalities using sequence-dimension concatenation for vision and layer-dimension mapping for speech, achieving strong performance with less data. AI-generated summary The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments . In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments . To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction , and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction , offering users a comprehensive multimodal experience.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13642
• PDF: https://arxiv.org/pdf/2506.13642
• Project Page: https://github.com/ictnlp/Stream-Omni
• Github: https://github.com/ictnlp/Stream-Omni
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
Stream-Omni, a large multimodal model, integrates text, vision, and speech by efficiently aligning modalities using sequence-dimension concatenation for vision and layer-dimension mapping for speech, achieving strong performance with less data. AI-generated summary The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments . In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments . To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction , and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction , offering users a comprehensive multimodal experience.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13642
• PDF: https://arxiv.org/pdf/2506.13642
• Project Page: https://github.com/ictnlp/Stream-Omni
• Github: https://github.com/ictnlp/Stream-Omni
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
LLMs perform well on implementation-heavy competitive programming problems but struggle with nuanced algorithmic reasoning, as highlighted by LiveCodeBench Pro. AI-generated summary Recent reports claim that large language models ( LLMs ) now outperform elite humans in competitive programming . Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro , a benchmark composed of problems from Codeforces , ICPC , and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis , often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11928
• PDF: https://arxiv.org/pdf/2506.11928
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
🔹 Publication Date: Published on Jun 13
🔹 Abstract:
LLMs perform well on implementation-heavy competitive programming problems but struggle with nuanced algorithmic reasoning, as highlighted by LiveCodeBench Pro. AI-generated summary Recent reports claim that large language models ( LLMs ) now outperform elite humans in competitive programming . Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro , a benchmark composed of problems from Codeforces , ICPC , and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis , often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.11928
• PDF: https://arxiv.org/pdf/2506.11928
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in...
Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests,...
❤3
🔹 Title:
Mathesis: Towards Formal Theorem Proving from Natural Languages
🔹 Publication Date: Published on Jun 8
🔹 Abstract:
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07047
• PDF: https://arxiv.org/pdf/2506.07047
• Github: https://github.com/Huawei-AI4Math/Mathesis
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Mathesis: Towards Formal Theorem Proving from Natural Languages
🔹 Publication Date: Published on Jun 8
🔹 Abstract:
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07047
• PDF: https://arxiv.org/pdf/2506.07047
• Github: https://github.com/Huawei-AI4Math/Mathesis
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Mathesis: Towards Formal Theorem Proving from Natural Languages
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal...
❤2
🔹 Title:
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
A new evaluation metric called Alignment Quality Index (AQI) assesses the alignment of large language models by analyzing latent space activations, capturing clustering quality to detect misalignments and fake alignment, and complementing existing behavioral proxies. AI-generated summary Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking . To address this issue, we introduce the Alignment Quality Index (AQI) . This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space . By combining measures such as the Davies-Bouldin Score (DBS) , Dunn Index (DI) , Xie-Beni Index (XBI) , and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking , offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO , GRPO , and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13901
• PDF: https://arxiv.org/pdf/2506.13901
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
A new evaluation metric called Alignment Quality Index (AQI) assesses the alignment of large language models by analyzing latent space activations, capturing clustering quality to detect misalignments and fake alignment, and complementing existing behavioral proxies. AI-generated summary Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking . To address this issue, we introduce the Alignment Quality Index (AQI) . This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space . By combining measures such as the Davies-Bouldin Score (DBS) , Dunn Index (DI) , Xie-Beni Index (XBI) , and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking , offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO , GRPO , and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13901
• PDF: https://arxiv.org/pdf/2506.13901
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🔹 Title:
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
🔹 Publication Date: Published on Jan 4, 2024
🔹 Abstract:
Co-training of supervised behavior cloning with static and mobile manipulation datasets improves the success rates of mobile manipulation tasks using a whole-body teleoperation system. AI-generated summary Imitation learning from human demonstrations has shown impressive performance in robotics . However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control . We first present Mobile ALOHA, a low-cost and whole-body teleoperation system for data collection. It augments the ALOHA system with a mobile base, and a whole-body teleoperation interface. Using data collected with Mobile ALOHA, we then perform supervised behavior cloning and find that co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks. With 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet. Project website: https://mobile-aloha.github.io
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2401.02117
• PDF: https://arxiv.org/pdf/2401.02117
• Github: https://mobile-aloha.github.io/
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/lerobot/aloha_mobile_cabinet
• https://huggingface.co/datasets/lerobot/aloha_mobile_chair
• https://huggingface.co/datasets/lerobot/aloha_mobile_wipe_wine
• https://huggingface.co/datasets/lerobot/aloha_mobile_wash_pan
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/fracapuano/remoteserver
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
🔹 Publication Date: Published on Jan 4, 2024
🔹 Abstract:
Co-training of supervised behavior cloning with static and mobile manipulation datasets improves the success rates of mobile manipulation tasks using a whole-body teleoperation system. AI-generated summary Imitation learning from human demonstrations has shown impressive performance in robotics . However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control . We first present Mobile ALOHA, a low-cost and whole-body teleoperation system for data collection. It augments the ALOHA system with a mobile base, and a whole-body teleoperation interface. Using data collected with Mobile ALOHA, we then perform supervised behavior cloning and find that co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks. With 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet. Project website: https://mobile-aloha.github.io
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2401.02117
• PDF: https://arxiv.org/pdf/2401.02117
• Github: https://mobile-aloha.github.io/
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/lerobot/aloha_mobile_cabinet
• https://huggingface.co/datasets/lerobot/aloha_mobile_chair
• https://huggingface.co/datasets/lerobot/aloha_mobile_wipe_wine
• https://huggingface.co/datasets/lerobot/aloha_mobile_wash_pan
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/fracapuano/remoteserver
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
by Zipeng Fu*, Tony Z. Zhao* and Chelsea Finn at Stanford
❤2
🔹 Title:
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
RE-IMAGINE evaluates the reasoning abilities of Large Language Models by generating variations of problems that cannot be solved by memorization, indicating reliance on statistical recall. AI-generated summary Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE , a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation , RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains , including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15455
• PDF: https://arxiv.org/pdf/2506.15455
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
🔹 Publication Date: Published on Jun 18
🔹 Abstract:
RE-IMAGINE evaluates the reasoning abilities of Large Language Models by generating variations of problems that cannot be solved by memorization, indicating reliance on statistical recall. AI-generated summary Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE , a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation , RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains , including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15455
• PDF: https://arxiv.org/pdf/2506.15455
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
A novel framework for end-to-end human animation with multi-modal conditions enables high-quality video generation through explicit layout control and region-specific modality matching. AI-generated summary End-to-end human animation with rich multi-modal conditions , e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09984
• PDF: https://arxiv.org/pdf/2506.09984
• Github: https://zhenzhiwang.github.io/interacthuman/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
A novel framework for end-to-end human animation with multi-modal conditions enables high-quality video generation through explicit layout control and region-specific modality matching. AI-generated summary End-to-end human animation with rich multi-modal conditions , e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09984
• PDF: https://arxiv.org/pdf/2506.09984
• Github: https://zhenzhiwang.github.io/interacthuman/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
InterActHuman: Multi-Concept Human Animation with Layout-Aligned...
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a...
❤2
Article Title:
SOAP: Style-Omniscient Animatable Portraits
Article Date: 8 May 2025
Article Description:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.05022v2.pdf
GitHub:
• https://github.com/tingtingliao/soap
Datasets:
• NeRF
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SOAP: Style-Omniscient Animatable Portraits
Article Date: 8 May 2025
Article Description:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.05022v2.pdf
GitHub:
• https://github.com/tingtingliao/soap
Datasets:
• NeRF
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Article Date: 29 May 2025
Article Description:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf
GitHub:
• https://github.com/jennyzzt/dgm
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Article Date: 29 May 2025
Article Description:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf
GitHub:
• https://github.com/jennyzzt/dgm
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5
🔹 Title:
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
AniMaker, a multi-agent framework using MCTS-Gen and AniEval, generates coherent storytelling videos from text input, outperforming existing models with better quality and efficiency. AI-generated summary Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent , an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent , the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion , and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10540
• PDF: https://arxiv.org/pdf/2506.10540
• Github: https://github.com/HITsz-TMG/Anim-Director
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
AniMaker, a multi-agent framework using MCTS-Gen and AniEval, generates coherent storytelling videos from text input, outperforming existing models with better quality and efficiency. AI-generated summary Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent , an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent , the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion , and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10540
• PDF: https://arxiv.org/pdf/2506.10540
• Github: https://github.com/HITsz-TMG/Anim-Director
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
AniMaker: Automated Multi-Agent Animated Storytelling with...
Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert...
❤4
Article Title:
Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
Article Date: 5 Jun 2025
Article Description:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05280v1.pdf
GitHub:
• https://github.com/bigcileng/bilateral-driving
Datasets:
• NeRF
• nuScenes
• PandaSet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
Article Date: 5 Jun 2025
Article Description:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05280v1.pdf
GitHub:
• https://github.com/bigcileng/bilateral-driving
Datasets:
• NeRF
• nuScenes
• PandaSet
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
Visual Causal Scene Refinement for Video Question Answering
Article Date: 7 May 2023
Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf
GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• NExT-QA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Visual Causal Scene Refinement for Video Question Answering
Article Date: 7 May 2023
Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf
GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning
Datasets:
• NExT-QA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
Article Title:
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Article Date: 23 May 2025
Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf
GitHub:
• https://github.com/DreamTechAI/Direct3D-S2
Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
Article Date: 23 May 2025
Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf
GitHub:
• https://github.com/DreamTechAI/Direct3D-S2
Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👏1
🔹 Title:
Sparsified State-Space Models are Efficient Highway Networks
🔹 Publication Date: Published on May 27
🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Sparsified State-Space Models are Efficient Highway Networks
🔹 Publication Date: Published on May 27
🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
woominsong - Overview
woominsong has 38 repositories available. Follow their code on GitHub.
❤3