Data Science | Machine Learning with Python for Researchers
32.6K subscribers
3.25K photos
117 videos
23 files
3.45K links
ads: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
🔹 Title: Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

🔹 Publication Date: Published on Jul 31

🔹 Abstract: Reinforcement Learning enhances generalizable spatial reasoning and interaction in 3D environments through cross-view goal specification and automated task synthesis, achieving zero-shot generalization and improved interaction success rates. AI-generated summary While Reinforcement Learning ( RL ) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents . A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL -finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen wo rl ds. Specifically, we explore RL 's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D wo rl ds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-wo rl d settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents ' spatial reasoning .

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23698

• PDF: https://arxiv.org/pdf/2507.23698

• Github: https://github.com/CraftJarvis/ROCKET-3

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
3
🔹 Title: Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

🔹 Publication Date: Published on Aug 4

🔹 Abstract: Seed Diffusion Preview, a discrete-state diffusion language model, achieves fast inference speeds through parallel generation, outperforming Mercury and Gemini Diffusion in speed and quality. AI-generated summary We present Seed Diffusion Preview , a large-scale language model based on discrete-state diffusion , offering remarkably fast inference speed. Thanks to non-sequential , parallel generation , discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding , as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02193

• PDF: https://arxiv.org/pdf/2508.02193

• Project Page: https://seed.bytedance.com/en/seed_diffusion

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
1🔥1
🔹 Title: LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

🔹 Publication Date: Published on Aug 1

🔹 Abstract: LAMIC, a Layout-Aware Multi-Image Composition framework, extends single-reference diffusion models to multi-reference scenarios using attention mechanisms, achieving state-of-the-art performance in controllable image synthesis without training. AI-generated summary In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC , a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio ( IN-R ) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity ( BG-S ) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S , BG-S , IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC 's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC 's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/ LAMIC .

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00477

• PDF: https://arxiv.org/pdf/2508.00477

• Github: https://github.com/Suchenl/LAMIC

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
3🔥1
🔹 Title:
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

🔹 Publication Date: Published on Jul 23

🔹 Abstract:
PRIX, an end-to-end driving architecture using only camera data, achieves state-of-the-art performance with a Context-aware Recalibration Transformer, outperforming larger multimodal planners in efficiency and scalability. AI-generated summary While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR . PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer ( CaRT ), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.

🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.17596
• PDF: https://arxiv.org/pdf/2507.17596

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:

https://t.iss.one/DataScienceT
3👍1
🔹 Title: Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

🔹 Publication Date: Published on Aug 6

🔹 Abstract: A lightweight, hyperparameter-free RL algorithm, VL-DAC, enables VLMs to learn generalized policies from inexpensive simulators, improving performance on real-world benchmarks without sacrificing image understanding accuracy. AI-generated summary Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC) , a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time ( MiniWorld , Gym-Cards , ALFWorld , or WebShop ) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.04280

• PDF: https://arxiv.org/pdf/2508.04280

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2👍1
🔹 Title: Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

🔹 Publication Date: Published on Aug 5

🔹 Abstract: Skywork UniPic, a 1.5 billion-parameter autoregressive model, unifies image understanding, text-to-image generation, and image editing with state-of-the-art performance on commodity hardware. AI-generated summary We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding , text-to-image generation , and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing ; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03320

• PDF: https://arxiv.org/pdf/2508.03320

• Github: https://github.com/SkyworkAI/UniPic

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
https://huggingface.co/spaces/Skywork/UniPic
==================================

For more data science resources:
https://t.iss.one/DataScienceT
❤‍🔥11
🔹 Title: SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

🔹 Publication Date: Published on Aug 5

🔹 Abstract: SonicMaster, a unified generative model, improves music audio quality by addressing various artifacts using text-based control and a flow-matching generative training paradigm. AI-generated summary Music recordings often suffer from audio quality issues such as excessive reverb eration, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization , dynamics , reverb , amplitude , and stereo . Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03448

• PDF: https://arxiv.org/pdf/2508.03448

• Github: https://amaai-lab.github.io/SonicMaster/

🔹 Datasets citing this paper:
https://huggingface.co/datasets/amaai-lab/SonicMasterDataset

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2
🔹 Title: Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe

🔹 Publication Date: Published on Aug 3

🔹 Abstract: Voxlect is a benchmark for evaluating speech foundation models on dialect classification and downstream applications across multiple languages and dialects. AI-generated summary We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models . Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information . We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification , we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems . Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01691

• PDF: https://arxiv.org/pdf/2508.01691

• Github: https://github.com/tiantiaf0627/voxlect/tree/main

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2
🔹 Title: A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding

🔹 Publication Date: Published on Aug 2

🔹 Abstract: A benchmark and model for 3D occupancy grounding using natural language and voxel-level annotations improve object perception in autonomous driving. AI-generated summary Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset , it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding . The dataset is available at https://github.com/RONINGOD/GroundingOcc.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01197

• PDF: https://arxiv.org/pdf/2508.01197

• Github: https://github.com/RONINGOD/GroundingOcc

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
1
🔹 Title: R-Zero: Self-Evolving Reasoning LLM from Zero Data

🔹 Publication Date: Published on Aug 7

🔹 Abstract: R-Zero is a self-evolving framework that autonomously generates and learns from its own training data, improving reasoning capabilities in LLMs without human-curated tasks. AI-generated summary Self-evolving Large Language Models ( LLMs ) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero , a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver . These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger . This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs , e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks .

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.05004

• PDF: https://arxiv.org/pdf/2508.05004

• Project Page: https://chengsong-huang.github.io/R-Zero.github.io/

• Github: https://github.com/Chengsong-Huang/R-Zero

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
4
🔹 Title: AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

🔹 Publication Date: Published on Jul 26

🔹 Abstract: AgentTTS, an LLM-agent-based framework, optimizes compute allocation for multi-stage complex tasks, improving performance and robustness compared to traditional methods. AI-generated summary Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks , composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks , aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks . Informed by these insights, we propose AgentTTS , an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00890

• PDF: https://arxiv.org/pdf/2508.00890

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
1
🔹 Title: Flow Equivariant Recurrent Neural Networks

🔹 Publication Date: Published on Jul 20

🔹 Abstract: Equivariant neural network architectures are extended to handle time-parameterized transformations, improving performance in sequence models like RNNs. AI-generated summary Data arrives at our senses as a continuous stream, smoothly transforming from one instant to the next. These smooth transformations can be viewed as continuous symmetries of the environment that we inhabit, defining equivalence relations between stimuli over time. In machine learning, neural network architectures that respect symmetries of their data are called equivariant and have provable benefits in terms of generalization ability and sample efficiency. To date, however, equivariance has been considered only for static transformations and feed-forward networks, limiting its applicability to sequence models , such as recurrent neural networks ( RNNs ), and corresponding time-parameterized sequence transformations. In this work, we extend equivariant network theory to this regime of `flows' -- one-parameter Lie subgroups capturing natural transformations over time, such as visual motion . We begin by showing that standard RNNs are generally not flow equivariant : their hidden states fail to transform in a geometrically structured manner for moving stimuli. We then show how flow equivariance can be introduced, and demonstrate that these models significantly outperform their non- equivariant counterparts in terms of training speed , length generalization , and velocity generalization, on both next step prediction and sequence classification . We present this work as a first step towards building sequence models that respect the time-parameterized symmetries which govern the world around us.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.14793

• PDF: https://arxiv.org/pdf/2507.14793

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
3
🔹 Title: LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

🔹 Publication Date: Published on Aug 5

🔹 Abstract: LongVie, an end-to-end autoregressive framework, addresses temporal consistency and visual degradation in ultra-long video generation through unified noise initialization, global control signal normalization, multi-modal control, and degradation-aware training. AI-generated summary Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation . In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency : 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench , a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03694

• PDF: https://arxiv.org/pdf/2508.03694

• Project Page: https://vchitect.github.io/LongVie-project

• Github: https://github.com/Vchitect/LongVie

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
🙏21
🔹 Title: The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

🔹 Publication Date: Published on Jul 31

🔹 Abstract: Transformer-based text-to-image diffusion models show varying degrees of content-style separation in generated artworks, as revealed by cross-attention heatmaps. AI-generated summary Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23313

• PDF: https://arxiv.org/pdf/2507.23313

• Project Page: https://thecowofrembrandt.islab.di.unimi.it/

• Github: https://github.com/umilISLab/artistic-prompt-interpretation

🔹 Datasets citing this paper:
https://huggingface.co/datasets/sergiopicascia/thecowofrembrandt

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
❤‍🔥3
🔹 Title: DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

🔹 Publication Date: Published on Aug 7

🔹 Abstract: DeepPHY evaluates Vision Language Models' physical reasoning and control through simulated environments with varying difficulty levels. AI-generated summary Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY , a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments . DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics . Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.05405

• PDF: https://arxiv.org/pdf/2508.05405

• Project Page: https://github.com/XinrunXu/DeepPHY

• Github: https://github.com/XinrunXu/DeepPHY

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2
This channels is for Programmers, Coders, Software Engineers.

0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages

https://t.iss.one/addlist/8_rRW2scgfRhOTc0

https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
🔹 Title: OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets

🔹 Publication Date: Published on Aug 3

🔹 Abstract: OpenMed NER, a suite of open-source transformer models using DAPT and LoRA, achieves state-of-the-art performance on diverse biomedical NER benchmarks with high efficiency and low computational cost. AI-generated summary Named-entity recognition (NER) is fundamental to extracting structured information from the >80% of healthcare data that resides in unstructured clinical notes and biomedical literature. Despite recent advances with large language models, achieving state-of-the-art performance across diverse entity types while maintaining computational efficiency remains a significant challenge. We introduce OpenMed NER, a suite of open-source, domain-adapted transformer models that combine lightweight domain-adaptive pre-training (DAPT) with parameter-efficient Low-Rank Adaptation (LoRA) . Our approach performs cost-effective DAPT on a 350k-passage corpus compiled from ethically sourced, publicly available research repositories and de-identified clinical notes (PubMed, arXiv, and MIMIC-III) using DeBERTa-v3 , PubMedBERT , and BioELECTRA backbones. This is followed by task-specific fine-tuning with LoRA, which updates less than 1.5% of model parameters. We evaluate our models on 12 established biomedical NER benchmarks spanning chemicals, diseases, gene s, and species. OpenMed NER achieves new state-of-the-art micro-F1 scores on 10 of these 12 datasets, with substantial gains across diverse entity types. Our models advance the state-of-the-art on foundational disease and chemical benchmarks (e.g., BC5CDR-Disease , +2.70 pp), while delivering even larger improvements of over 5.3 and 9.7 percentage points on more specialized gene and clinical cell line corpora . This work demonstrates that strategically adapted open-source models can surpass closed-source solutions. This performance is achieved with remarkable efficiency: training completes in under 12 hours on a single GPU with a low carbon footprint (

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01630

• PDF: https://arxiv.org/pdf/2508.01630

• Project Page: https://huggingface.co/OpenMed

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
https://huggingface.co/spaces/OpenMed/openmed-clinical-ner

https://huggingface.co/spaces/lz-12/space
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2
🔹 Title: SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

🔹 Publication Date: Published on Aug 3

🔹 Abstract: A new training paradigm and situated embedding models (SitEmb) enhance retrieval performance by conditioning short text chunks on broader context windows, outperforming state-of-the-art models with fewer parameters. AI-generated summary Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context window s to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb) . To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models , including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications .

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01959

• PDF: https://arxiv.org/pdf/2508.01959

• Project Page: https://huggingface.co/SituatedEmbedding

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
5
🔹 Title: On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

🔹 Publication Date: Published on Aug 7

🔹 Abstract: Dynamic Fine-Tuning (DFT) improves the generalization of Large Language Models (LLMs) by dynamically rescaling gradients, outperforming standard Supervised Fine-Tuning (SFT) and showing competitive results in offline reinforcement learning. AI-generated summary We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.05629

• PDF: https://arxiv.org/pdf/2508.05629

• Github: https://github.com/yongliang-wu/DFT

🔹 Datasets citing this paper:
No datasets found

🔹 Spaces citing this paper:
No spaces found
==================================

For more data science resources:
https://t.iss.one/DataScienceT
2