Data Science | Machine Learning with Python for Researchers

Spot the difference.

❤1

1.86K views08:02

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Mathesis: Towards Formal Theorem Proving from Natural Languages

🔹 Publication Date: Published on Jun 8

🔹 Abstract:
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.07047
• PDF: https://arxiv.org/pdf/2506.07047
• Github: https://github.com/Huawei-AI4Math/Mathesis

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Mathesis: Towards Formal Theorem Proving from Natural Languages

Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal...

❤2

889 views13:01

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

🔹 Publication Date: Published on Jun 16

🔹 Abstract:
A new evaluation metric called Alignment Quality Index (AQI) assesses the alignment of large language models by analyzing latent space activations, capturing clustering quality to detect misalignments and fake alignment, and complementing existing behavioral proxies. AI-generated summary Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking . To address this issue, we introduce the Alignment Quality Index (AQI) . This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space . By combining measures such as the Davies-Bouldin Score (DBS) , Dunn Index (DI) , Xie-Beni Index (XBI) , and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking , offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO , GRPO , and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13901
• PDF: https://arxiv.org/pdf/2506.13901

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

790 views13:03

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

🔹 Publication Date: Published on Jan 4, 2024

🔹 Abstract:
Co-training of supervised behavior cloning with static and mobile manipulation datasets improves the success rates of mobile manipulation tasks using a whole-body teleoperation system. AI-generated summary Imitation learning from human demonstrations has shown impressive performance in robotics . However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control . We first present Mobile ALOHA, a low-cost and whole-body teleoperation system for data collection. It augments the ALOHA system with a mobile base, and a whole-body teleoperation interface. Using data collected with Mobile ALOHA, we then perform supervised behavior cloning and find that co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks. With 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet. Project website: https://mobile-aloha.github.io

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2401.02117
• PDF: https://arxiv.org/pdf/2401.02117
• Github: https://mobile-aloha.github.io/

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/lerobot/aloha_mobile_cabinet
• https://huggingface.co/datasets/lerobot/aloha_mobile_chair
• https://huggingface.co/datasets/lerobot/aloha_mobile_wipe_wine
• https://huggingface.co/datasets/lerobot/aloha_mobile_wash_pan

🔹 Spaces citing this paper:
• https://huggingface.co/spaces/fracapuano/remoteserver
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

by Zipeng Fu*, Tony Z. Zhao* and Chelsea Finn at Stanford

❤2

962 views13:24

Data Science | Machine Learning with Python for Researchers

🔹 Title:
RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

🔹 Publication Date: Published on Jun 18

🔹 Abstract:
RE-IMAGINE evaluates the reasoning abilities of Large Language Models by generating variations of problems that cannot be solved by memorization, indicating reliance on statistical recall. AI-generated summary Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE , a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation , RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains , including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.15455
• PDF: https://arxiv.org/pdf/2506.15455

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

1.07K views14:24

Data Science | Machine Learning with Python for Researchers

🔹 Title:
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

🔹 Publication Date: Published on Jun 11

🔹 Abstract:
A novel framework for end-to-end human animation with multi-modal conditions enables high-quality video generation through explicit layout control and region-specific modality matching. AI-generated summary End-to-end human animation with rich multi-modal conditions , e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09984
• PDF: https://arxiv.org/pdf/2506.09984
• Github: https://zhenzhiwang.github.io/interacthuman/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

InterActHuman: Multi-Concept Human Animation with Layout-Aligned...

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a...

❤2

996 views06:08

Data Science | Machine Learning with Python for Researchers

Article Title:
SOAP: Style-Omniscient Animatable Portraits

Article Date: 8 May 2025

Article Description:
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at https://github.com/TingtingLiao/soap.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.05022v2.pdf

GitHub:
• https://github.com/tingtingliao/soap

Datasets:
• NeRF
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

837 views06:50

Data Science | Machine Learning with Python for Researchers

Article Title:
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Article Date: 29 May 2025

Article Description:
Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.22954v1.pdf

GitHub:
• https://github.com/jennyzzt/dgm

Datasets:
• No datasets information available
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤5

945 views06:56

Data Science | Machine Learning with Python for Researchers

🔹 Title:
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

🔹 Publication Date: Published on Jun 12

🔹 Abstract:
AniMaker, a multi-agent framework using MCTS-Gen and AniEval, generates coherent storytelling videos from text input, outperforming existing models with better quality and efficiency. AI-generated summary Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent , an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent , the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion , and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10540
• PDF: https://arxiv.org/pdf/2506.10540
• Github: https://github.com/HITsz-TMG/Anim-Director

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

AniMaker: Automated Multi-Agent Animated Storytelling with...

Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert...

❤4

1.14K views11:15

Data Science | Machine Learning with Python for Researchers

Article Title:
Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

Article Date: 5 Jun 2025

Article Description:
Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.05280v1.pdf

GitHub:
• https://github.com/bigcileng/bilateral-driving

Datasets:
• NeRF
• nuScenes
• PandaSet
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

1.16K views13:57

Data Science | Machine Learning with Python for Researchers

Article Title:
Visual Causal Scene Refinement for Video Question Answering

Article Date: 7 May 2023

Article Description:
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2305.04224v2.pdf

GitHub:
• https://github.com/yangliu9208/vcsr
• https://github.com/hcplab-sysu/causal-vlreasoning

Datasets:
• NExT-QA
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

1.27K views20:49

Data Science | Machine Learning with Python for Researchers

Article Title:
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Article Date: 23 May 2025

Article Description:
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.17412v2.pdf

GitHub:
• https://github.com/DreamTechAI/Direct3D-S2

Datasets:
• ShapeNet
• Objaverse
• Objaverse-XL
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3👏1

1.09K views10:55

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Sparsified State-Space Models are Efficient Highway Networks

🔹 Publication Date: Published on May 27

🔹 Abstract:
Simba, a hierarchical sparsification method for state-space models, enhances efficiency and information flow in natural language tasks by pruning tokens more aggressively in upper layers. AI-generated summary State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences . In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information , while lower layers encode local information . Motivated by this, we introduce Simba , a hierarchical sparsification method for SSMs based on token pruning . Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways . To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks . Moreover, we illustrate the effect of highways , showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/ Simba .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2505.20698
• PDF: https://arxiv.org/pdf/2505.20698

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

woominsong - Overview

woominsong has 38 repositories available. Follow their code on GitHub.

❤3

1.17K views12:14

Data Science | Machine Learning with Python for Researchers

Article Title:
LeVo: High-Quality Song Generation with Multi-Preference Alignment

Article Date: 9 Jun 2025

Article Description:
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at https://levo-demo.github.io/. Code is released at https://github.com/tencent-ailab/songgeneration.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.07520v2.pdf

GitHub:
• https://github.com/tencent-ailab/songgeneration

Datasets:
• 100style
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤1

1.11K views20:47

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

✅

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

✅

https://t.iss.one/Codeprogrammer

Please open Telegram to view this post

VIEW IN TELEGRAM

❤2

487 views21:21

Data Science | Machine Learning with Python for Researchers

🔹 Title:
MoCha: Towards Movie-Grade Talking Character Synthesis

🔹 Publication Date: Published on Mar 30

🔹 Abstract:
MoCha generates realistic talking character animations from speech and text using a speech-video attention mechanism and joint training on speech-labeled and text-labeled data, enabling multi-character conversations and superior realism. AI-generated summary Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha , the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence . Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.23307
• PDF: https://arxiv.org/pdf/2503.23307
• Project Page: https://congwei1230.github.io/MoCha/
• Github: https://github.com/congwei1230/MoChaBench

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/CongWei1230/MoCha-Generation-on-MoChaBench-Visualizer
• https://huggingface.co/datasets/CongWei1230/MoChaBench

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

MoCha: Towards Movie-Grade Talking Character Synthesis

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We...

❤4

1.1K views08:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

🔹 Publication Date: Published on Jun 24

🔹 Abstract:
Researchers propose design principles for cascaded video super-resolution models to improve high-resolution video generation by introduces degradation strategies, timestep sampling, noise augmentation, and interleaving temporal units with sparse local attention. AI-generated summary Latent diffusion models have emerged as a leading paradigm for efficient video generation . However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution ( VSR ) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19838
• PDF: https://arxiv.org/pdf/2506.19838

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4

1.17K views10:38

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

Top 50 LLM Interview Questions!

A comprehensive resource that covers traditional ML basics, model architectures, real-world case studies, and theoretical foundations.

👇👇👇👇👇👇