Data Science | Machine Learning with Python for Researchers
32.5K subscribers
3.11K photos
107 videos
22 files
3.33K links
ads: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
Article Title:
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Article Date: 20 May 2025

Article Description:
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/DolphinPDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.14059v1.pdf

GitHub:
https://github.com/bytedance/dolphin

Datasets:
• PubTabNet
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
Article Title:
OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Article Date: 5 Jun 2025

Article Description:
Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.05204v1.pdf

GitHub:
https://github.com/Yanbo-23/OGGSplat

Datasets:
• S3DIS
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3🎉1
Article Title:
Facial Appearance Capture at Home with Patch-Level Reflectance Prior

Article Date: 4 Jun 2025

Article Description:
Existing facial appearance capture methods can reconstruct plausible facial reflectance from smartphone-recorded videos. However, the reconstruction quality is still far behind the ones based on studio recordings. This paper fills the gap by developing a novel daily-used solution with a co-located smartphone and flashlight video capture setting in a dim room. To enhance the quality, our key observation is to solve facial reflectance maps within the data distribution of studio-scanned ones. Specifically, we first learn a diffusion prior over the Light Stage scans and then steer it to produce the reflectance map that best matches the captured images. We propose to train the diffusion prior at the patch level to improve generalization ability and training stability, as current Light Stage datasets are in ultra-high resolution but limited in data size. Tailored to this prior, we propose a patch-level posterior sampling technique to sample seamless full-resolution reflectance maps from this patch-level diffusion model. Experiments demonstrate our method closes the quality gap between low-cost and studio recordings by a large margin, opening the door for everyday users to clone themselves to the digital world. Our code will be released at https://github.com/yxuhan/DoRA.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.03478v1.pdf

GitHub:
https://github.com/yxuhan/dora

Datasets:
• NeRF
==================================

For more data science resources:

https://t.iss.one/DataScienceT
4
🔹 Title:
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

🔹 Publication Date: Published on Jul 15

🔹 Abstract:
DIJA is a framework that exploits safety weaknesses in diffusion-based large language models by constructing adversarial prompts, demonstrating significant vulnerabilities in their alignment mechanisms. AI-generated summary Diffusion-based large language models ( dLLMs ) have recently emerged as a powerful alternative to autoregressive LLMs , offering faster inference and greater interactivity via parallel decoding and bidirectional modeling . However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware , masked-input adversarial prompts , exposing novel vulnerabilities. To this end, we present DIJA , the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs . Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs , i.e., bidirectional modeling and parallel decoding . Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs , even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score , while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/ DIJA .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11097
• PDF: https://arxiv.org/pdf/2507.11097
• Github: https://github.com/ZichenWen1/DIJA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
🔹 Title:
4KAgent: Agentic Any Image to 4K Super-Resolution

🔹 Publication Date: Published on Jul 9

🔹 Abstract:
4KAgent, a unified agentic super-resolution system, enhances low-resolution images to 4K using profiling, perception, and restoration agents, achieving state-of-the-art performance across various imaging domains. AI-generated summary We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling , a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent , which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE , MUSIQ ) and fidelity (e.g., PSNR ) metrics. By establishing a novel agentic paradigm for low-level vision tasks , we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/4kagent-agentic-any-image-to-4k-super-resolution
• PDF: https://arxiv.org/pdf/2507.07105
• Project Page: https://huggingface.co/collections/tonton5093/side-project-67eb59863bd520640f423b9a
• Github: https://4kagent.github.io/

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
6
🔹 Title:
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

🔹 Publication Date: Published on Jun 17

🔹 Abstract:
This study investigates long-context performance of diffusion LLMs compared to auto-regressive LLMs, identifies their unique characteristics, and proposes LongLLaDA, a training-free method for extending context windows. AI-generated summary Large Language Diffusion Models, or diffusion LLMs , have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs . We first identify a unique characteristic of diffusion LLMs , unlike auto-regressive LLMs , they maintain remarkably \textit{ stable perplexity } during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textit{ local perception } phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA , a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs . Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs .

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.14429
• PDF: https://arxiv.org/pdf/2506.14429

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
T-LoRA: Single Image Diffusion Model Customization Without Overfitting

🔹 Publication Date: Published on Jul 8

🔹 Abstract:
T-LoRA, a timestep-dependent low-rank adaptation framework, enhances diffusion model personalization with a dynamic fine-tuning strategy and orthogonal initialization, achieving better concept fidelity and text alignment in data-limited settings. AI-generated summary While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity . This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA , a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps , and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization . Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment , highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/ T-LoRA .

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/t-lora-single-image-diffusion-model-customization-without-overfitting
• PDF: https://arxiv.org/pdf/2507.05964
• Github: https://github.com/ControlGenAI/T-LoRA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
🔹 Title:
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

🔹 Publication Date: Published on Jul 2

🔹 Abstract:
Multimodal foundation models, despite being primarily trained on image-text tasks, demonstrate respectable performance across various vision tasks when adapted through prompt chaining, though they fall short compared to specialized models. AI-generated summary Multimodal foundation models, such as GPT-4o , have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models ( GPT-4o , o4-mini , Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet , Qwen2-VL , Llama 3.2 ) on standard computer vision tasks ( semantic segmentation , object detection , image classification , depth and surface normal prediction ) using established datasets (e.g., COCO , ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non- reasoning models , securing the top position in 4 out of 6 tasks, 6) reasoning models , e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o , shows they exhibit quirks like hallucinations and spatial misalignments.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/how-well-does-gpt-4o-understand-vision-evaluating-multimodal-foundation-models-on-standard-computer-vision-tasks
• PDF: https://arxiv.org/pdf/2507.01955
• Project Page: https://fm-vision-evals.epfl.ch/
• Github: https://github.com/EPFL-VILAB/fm-vision-evals

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
🔹 Title:
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

🔹 Publication Date: Published on Jul 18

🔹 Abstract:
Franca, an open-source vision foundation model, achieves high performance using a transparent training pipeline and novel clustering and disentanglement techniques. AI-generated summary We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B . Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp , they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient , multi-head clustering projector based on nested Matryoshka representations . This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks , demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.14137
• PDF: https://arxiv.org/pdf/2507.14137
• Project Page: https://huggingface.co/papers?q=multi-head%20clustering%20projector

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
Article Title:
Cautious Optimizers: Improving Training with One Line of Code

Article Date: 25 Nov 2024

Article Description:
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2411.16085v3.pdf

GitHub:
https://github.com/kyleliang919/c-optim
https://github.com/huggingface/pytorch-image-models
https://github.com/zhaoolee/garss

Datasets:
• GLUE
• QNLI
• C4
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
🔹 Title:
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

🔹 Publication Date: Published on Jul 6

🔹 Abstract:
DreamVLA improves robot manipulation through a VLA framework that incorporates world knowledge, dynamic-region guidance, and a diffusion-based transformer to ensure clear, disentangled representations for action planning. AI-generated summary Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction , integrated with the spatial and semantic cues , which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

🔹 Links:
• arXiv Page: https://arxivexplained.com/papers/dreamvla-a-vision-language-action-model-dreamed-with-comprehensive-world-knowledge
• PDF: https://arxiv.org/pdf/2507.04447
• Project Page: https://zhangwenyao1.github.io/DreamVLA/
• Github: https://github.com/Zhangwenyao1/DreamVLA

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
4
🔹 Title:
How to Train Your LLM Web Agent: A Statistical Diagnosis

🔹 Publication Date: Published on Jul 5

🔹 Abstract:
A study on compute allocation for post-training LLM-based web agents finds that combining supervised fine-tuning with on-policy reinforcement learning improves performance and reduces computational costs compared to either method alone. AI-generated summary LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents . To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++ . Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++ , effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.04103
• PDF: https://arxiv.org/pdf/2507.04103

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
6
🔹 Title:
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

🔹 Publication Date: Published on Jul 13

🔹 Abstract:
RedOne, a domain-specific LLM, enhances performance across multiple SNS tasks through a three-stage training strategy, improving generalization and reducing harmful content exposure. AI-generated summary As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models ( LLMs ) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining , supervised fine-tuning , and preference optimization , using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark , compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.10605
• PDF: https://arxiv.org/pdf/2507.10605

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2
Article Title:
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Article Date: 26 Jun 2025

Article Description:
While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, such generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present \textbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal large language model generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce \textbf{AudioCoT}, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio benchmark. The demo page is available at https://ThinkSound-Demo.github.io.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.21448v1.pdf

GitHub:
https://github.com/FunAudioLLM/ThinkSound

Datasets:
• AudioSet
• VGG-Sound
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
🔹 Title:
MOSPA: Human Motion Generation Driven by Spatial Audio

🔹 Publication Date: Published on Jul 16

🔹 Abstract:
A diffusion-based generative framework, MOSPA, is introduced to model human motion in response to spatial audio, achieving state-of-the-art performance using the newly created SAM dataset. AI-generated summary Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis . Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion . As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion . To bridge this gap and enable high-quality modeling of human movements in response to spatial audio , we introduce the first comprehensive Spatial Audio -Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio , termed MOSPA , which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motion s conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11949
• PDF: https://arxiv.org/pdf/2507.11949
• Project Page: https://frank-zy-dou.github.io/projects/MOSPA/index.html

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
6
🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://t.iss.one/+QHlfCJcO2lRjZWVl

You can join at this link! 👆👇

https://t.iss.one/+QHlfCJcO2lRjZWVl
2
🔹 Title:
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

🔹 Publication Date: Published on Jul 14

🔹 Abstract:
A new dataset, EmRACE-3K, evaluates vision-language models in embodied settings, showing limitations in spatial reasoning and long-horizon planning, and demonstrates improvements through supervised and reinforcement learning fine-tuning. AI-generated summary Recent advanced vision-language models (VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings , which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective , with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning . To address this gap, we introduce EmRACE-3K , a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation , object manipulation , and multi-stage goal execution . Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K , we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings , all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K , we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning . This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

🔹 Links:
• arXiv Page: https://arxiv.org/pdf/2507.10548
• PDF: https://arxiv.org/pdf/2507.10548
• Project Page: https://mxllc.github.io/EmbRACE-3K/
• Github: https://mxllc.github.io/EmbRACE-3K

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
1
🔹 Title:
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

🔹 Publication Date: Published on Jul 11

🔹 Abstract:
A novel image tokenizer built on pre-trained vision foundation models improves image reconstruction, generation quality, and token efficiency, enhancing autoregressive generation and class-conditional synthesis. AI-generated summary Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer , VFMTok , achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.08441
• PDF: https://arxiv.org/pdf/2507.08441

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3
This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
🔹 Title:
A Survey of Context Engineering for Large Language Models

🔹 Publication Date: Published on Jul 17

🔹 Abstract:
Context Engineering systematically optimizes information payloads for Large Language Models, addressing gaps in generating sophisticated, long-form outputs. AI-generated summary The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering , a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management . We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning , and multi-agent systems . Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering , demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

🔹 Links:
• arXiv Page: https://huggingface.co/collections/Maxwell-Jia/daily-arxiv-668d5e8d30bab29956b66b8d
• PDF: https://arxiv.org/pdf/2507.13334
• Github: https://github.com/Meirtz/Awesome-Context-Engineering

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
2