Data Science | Machine Learning with Python for Researchers

🔹 Title:
MOSPA: Human Motion Generation Driven by Spatial Audio

🔹 Publication Date: Published on Jul 16

🔹 Abstract:
A diffusion-based generative framework, MOSPA, is introduced to model human motion in response to spatial audio, achieving state-of-the-art performance using the newly created SAM dataset. AI-generated summary Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis . Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion . As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion . To bridge this gap and enable high-quality modeling of human movements in response to spatial audio , we introduce the first comprehensive Spatial Audio -Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio , termed MOSPA , which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motion s conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11949
• PDF: https://arxiv.org/pdf/2507.11949
• Project Page: https://frank-zy-dou.github.io/projects/MOSPA/index.html

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

MOSPA: Human Motion Generation Driven by Spatial Audio

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and...

❤6

1.18K views09:27

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://t.iss.one/+QHlfCJcO2lRjZWVl

You can join at this link! 👆👇

https://t.iss.one/+QHlfCJcO2lRjZWVl

❤2

578 views14:32

Data Science | Machine Learning with Python for Researchers

🔹 Title:
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

🔹 Publication Date: Published on Jul 14

🔹 Abstract:
A new dataset, EmRACE-3K, evaluates vision-language models in embodied settings, showing limitations in spatial reasoning and long-horizon planning, and demonstrates improvements through supervised and reinforcement learning fine-tuning. AI-generated summary Recent advanced vision-language models (VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings , which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective , with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning . To address this gap, we introduce EmRACE-3K , a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation , object manipulation , and multi-stage goal execution . Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K , we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings , all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K , we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning . This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

🔹 Links:
• arXiv Page: https://arxiv.org/pdf/2507.10548
• PDF: https://arxiv.org/pdf/2507.10548
• Project Page: https://mxllc.github.io/EmbRACE-3K/
• Github: https://mxllc.github.io/EmbRACE-3K

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤1

1.26K views01:41

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

🔹 Publication Date: Published on Jul 11

🔹 Abstract:
A novel image tokenizer built on pre-trained vision foundation models improves image reconstruction, generation quality, and token efficiency, enhancing autoregressive generation and class-conditional synthesis. AI-generated summary Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer , VFMTok , achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.08441
• PDF: https://arxiv.org/pdf/2507.08441

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Vision Foundation Models as Effective Visual Tokenizers for...

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen...

❤3

1.03K views05:16

Data Science | Machine Learning with Python for Researchers

Forwarded from Python | Machine Learning | Coding | R

This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

✅

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

✅

https://t.iss.one/Codeprogrammer

Please open Telegram to view this post

VIEW IN TELEGRAM

447 views05:35

Data Science | Machine Learning with Python for Researchers

🔹 Title:
A Survey of Context Engineering for Large Language Models

🔹 Publication Date: Published on Jul 17

🔹 Abstract:
Context Engineering systematically optimizes information payloads for Large Language Models, addressing gaps in generating sophisticated, long-form outputs. AI-generated summary The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering , a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management . We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning , and multi-agent systems . Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering , demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

🔹 Links:
• arXiv Page: https://huggingface.co/collections/Maxwell-Jia/daily-arxiv-668d5e8d30bab29956b66b8d
• PDF: https://arxiv.org/pdf/2507.13334
• Github: https://github.com/Meirtz/Awesome-Context-Engineering

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

huggingface.co

Daily arXiv - a Maxwell-Jia Collection

Unlock the magic of AI with handpicked models, awesome datasets, papers, and mind-blowing Spaces from Maxwell-Jia

❤2

979 views07:33

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Perception-Aware Policy Optimization for Multimodal Reasoning

🔹 Publication Date: Published on Jul 8

🔹 Abstract:
Perception-Aware Policy Optimization (PAPO) enhances reinforcement learning with verifiable rewards for multimodal reasoning by integrating implicit perception loss, improving visual perception and reasoning. AI-generated summary Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks . In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss . Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning . Project page: https://mikewangwzhl.github.io/PAPO.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.06448
• PDF: https://arxiv.org/pdf/2507.06448
• Project Page: https://mikewangwzhl.github.io/PAPO
• Github: https://mikewangwzhl.github.io/PAPO/

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train
• https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Perception-Aware Policy Optimization for Multimodal Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However,...

❤1

1.06K views07:37

Data Science | Machine Learning with Python for Researchers

🔹 Title:
MIRIX: Multi-Agent Memory System for LLM-Based Agents

🔹 Publication Date: Published on Jul 10

🔹 Abstract:
MIRIX, a modular multi-agent memory system, enhances AI memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks. AI-generated summary Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX , a modular , multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory , and Knowledge Vault , coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA , a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO , a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents . To allow users to experience our memory system, we provide a packaged application powered by MIRIX . It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07957
• PDF: https://arxiv.org/pdf/2507.07957
• Project Page: https://mirix.io/
• Github: https://github.com/Mirix-AI/MIRIX

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining...

❤2

933 views11:09

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

🔹 Publication Date: Published on Jul 16

🔹 Abstract:
Mono-InternVL, an advanced monolithic Multimodal Large Language Model, integrates visual experts and improved pre-training strategies to enhance visual learning and reduce computational costs while maintaining competitive performance. AI-generated summary This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning . Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture . In addition, we design an innovative Endogenous Visual Pre-training ( EViP ) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP ( EViP++ ). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations . With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench . Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.12566
• PDF: https://arxiv.org/pdf/2507.12566
• Github: https://github.com/OpenGVLab/Mono-InternVL

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

GitHub - OpenGVLab/Mono-InternVL: [CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models…

[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training - OpenGVLab/Mono-InternVL

❤1

899 views14:34

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

🔹 Publication Date: Published on Jun 24

🔹 Abstract:
Radial Attention, a scalable sparse attention mechanism, improves efficiency and preserves video quality in diffusion models by leveraging spatiotemporal energy decay. AI-generated summary Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models : post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention , a scalable sparse attention mechanism with O(n log n) complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard O(n^2) dense attention and more expressive than linear attention . Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning . Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B , HunyuanVideo , and Mochi 1 , achieving up to a 1.9times speedup over the original dense attention . With minimal tuning, it enables video generation up to 4times longer while reducing training costs by up to 4.4times compared to direct fine-tuning and accelerating inference by up to 3.7times compared to dense attention inference.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.19852
• PDF: https://arxiv.org/pdf/2506.19852
• Project Page: https://hanlab.mit.edu/projects/radial-attention
• Github: https://github.com/mit-han-lab/radial-attention

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay...

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on...

❤2

932 views19:40

Data Science | Machine Learning with Python for Researchers

🔹 Title:
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

🔹 Publication Date: Published on Jun 30

🔹 Abstract:
Self-play in zero-sum games using SPIRAL enhances reasoning capabilities in language models through self-improvement and transfer learning. AI-generated summary Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering . We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training ( TicTacToe , Kuhn Poker , Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/spiral-self-play-on-zero-sum-games-incentivizes-reasoning-via-multi-agent-multi-turn-reinforcement-learning
• PDF: https://arxiv.org/pdf/2506.24119
• Project Page: https://benjamin-eecs.github.io/blog/2025/spiral/
• Github: https://github.com/spiral-rl/spiral

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/spiral-rl/Spiral-Kuhn-Poker-Qwen3-32B-SFT

🔹 Spaces citing this paper:
• https://huggingface.co/spaces/kaushikvr06/reasoning-simulator
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

Arxivexplained

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning - Explained Simply

By Bo Liu, Leon Guertler, Simon Yu et al.. # SPIRAL: Teaching AI to Think by Playing Games Against Itself

**The Problem:** Current AI reasonin...

❤2

1.05K views05:48

Data Science | Machine Learning with Python for Researchers

🔹 Title:
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

🔹 Publication Date: Published on Jul 15

🔹 Abstract:
DrafterBench is an open-source benchmark for evaluating LLM agents in technical drawing revision, assessing their capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. AI-generated summary Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmark s are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision , a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness . The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution , instruction following , and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11527
• PDF: https://arxiv.org/pdf/2507.11527
• Github: https://github.com/Eason-Li-AIS/DrafterBench

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Eason666/DrafterBench

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

1.05K views14:43

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Calligrapher: Freestyle Text Image Customization

🔹 Publication Date: Published on Jun 30

🔹 Abstract:
Calligrapher uses a diffusion-based framework with self-distillation and localized style injection to generate high-quality, stylistically consistent digital typography. AI-generated summary We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder , which comprises both Qformer and linear layers , to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process , further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.24123
• PDF: https://arxiv.org/pdf/2506.24123
• Project Page: https://calligrapher2025.github.io/Calligrapher/
• Github: https://github.com/Calligrapher2025/Calligrapher

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤2

901 views04:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

🔹 Publication Date: Published on Jul 14

🔹 Abstract:
A large-scale dataset named SpeakerVid-5M is introduced for audio-visual dyadic interactive virtual human generation, featuring diverse interactions and high-quality data for various virtual human tasks. AI-generated summary The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking , listening , and dyadic conversations . Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types ( dialogue branch , single branch , listening branch and multi-turn branch ) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/ SpeakerVid-5M /

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.09862
• PDF: https://arxiv.org/pdf/2507.09862
• Project Page: https://dorniwang.github.io/SpeakerVid-5M/
• Github: https://dorniwang.github.io/SpeakerVid-5M/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

dorniwang.github.io

Duomin Wang (王多民)'s Homepage

Duomin Wang's Homepage

❤2

979 views07:36

Data Science | Machine Learning with Python for Researchers

🔹 Title:
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

🔹 Publication Date: Published on Jul 10

🔹 Abstract:
OST-Bench evaluates multimodal large language models in online spatio-temporal reasoning tasks, revealing challenges in handling complex spatial cues and long-term memory in real-world scenarios. AI-generated summary Recent advances in multimodal large language models ( MLLMs ) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench , a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet , Matterport3D , and ARKitScenes . We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning . Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07984
• PDF: https://arxiv.org/pdf/2507.07984
• Project Page: https://rbler1234.github.io/OSTBench.github.io/
• Github: https://github.com/OpenRobotLab/OST-Bench

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/rbler/OST-Bench

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

rbler1234.github.io

TWITTER BANNER TITLE META TAG

TWITTER BANNER DESCRIPTION META TAG

❤1

1.06K views22:39

Data Science | Machine Learning with Python for Researchers

important channel to get a job

798 views03:46

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Agentic Reinforced Policy Optimization

🔹 Publication Date: Published on Jul 26

🔹 Abstract:
Agentic Reinforced Policy Optimization (ARPO) is a novel RL algorithm that enhances multi-turn LLM-based agents by adaptive uncertainty management and advantage attribution, outperforming trajectory-level RL algorithms with reduced resource usage. AI-generated summary Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models ( LLMs ) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism , dynamically balancing global trajectory sampling and step-level sampling , thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation , ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning , knowledge reasoning , and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.19849
• PDF: https://arxiv.org/pdf/2507.19849
• Project Page: https://github.com/dongguanting/ARPO
• Github: https://github.com/dongguanting/ARPO

🔹 Datasets citing this paper:
• https://huggingface.co/datasets/dongguanting/ARPO-SFT-54K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-DeepSearch-1K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-Reasoning-10K

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤3

783 views06:18

Data Science | Machine Learning with Python for Researchers

🔹 Title:
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

🔹 Publication Date: Published on Jul 28

🔹 Abstract:
A multimodal model that processes visual, audio, and textual signals for structured comprehension of real-world short videos improves video search, recommendation, and engagement. AI-generated summary Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal model s lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization , open-ended video question answering , temporal video grounding , and video reasoning . Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning , cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning . Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20939
• PDF: https://arxiv.org/pdf/2507.20939
• Project Page: https://tencentarc.github.io/posts/arc-video-announcement/
• Github: https://github.com/TencentARC/ARC-Hunyuan-Video-7B

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

arXiv.org

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack...

❤2

722 views08:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Reconstructing 4D Spatial Intelligence: A Survey

🔹 Publication Date: Published on Jul 28

🔹 Abstract:
A survey organizes methods for reconstructing 4D spatial intelligence from visual observations into five progressive levels, offering analysis and identifying future research directions. AI-generated summary Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures , the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction . To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth , pose , and point maps ); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects , humans , structures ); (3) Level 3 -- reconstruction of 4D dynamic scenes ; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints . We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21045
• PDF: https://arxiv.org/pdf/2507.21045
• Github: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

GitHub

GitHub - yukangcao/Awesome-4D-Spatial-Intelligence: A curated list of awesome papers for reconstructing 4D spatial intelligence…

A curated list of awesome papers for reconstructing 4D spatial intelligence from video. (arXiv 2507.21045) - yukangcao/Awesome-4D-Spatial-Intelligence

❤4

773 views10:28

Data Science | Machine Learning with Python for Researchers

🔹 Title:
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

🔹 Publication Date: Published on Jul 28

🔹 Abstract:
Rep-MTL optimizes multi-task learning by leveraging task saliency in shared representations to promote complementarity and reduce negative transfer. AI-generated summary Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization ( MTO ) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space , where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO . This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment , Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21049
• PDF: https://arxiv.org/pdf/2507.21049
• Project Page: https://jacky1128.github.io/RepMTL/
• Github: https://github.com/Jacky1128/Rep-MTL

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

❤4👍1

789 views13:29

About

Blog

Apps

Platform