🔹 Title: 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
🔹 Publication Date: Published on Jul 31
🔹 Abstract: 3D-R1 enhances 3D scene understanding through a high-quality synthetic dataset, reinforcement learning with GRPO, and dynamic view selection, achieving significant improvements in reasoning and generalization. AI-generated summary Large vision-language models ( VLMs ) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding . However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1 , a foundation model that enhances the reasoning capabilities of 3D VLMs . Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K , leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro . It serves as cold-start initialization data for 3D-R1 . Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward , a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding . Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding . Code: https://github.com/AIGeeksGroup/ 3D-R1 . Website: https://aigeeksgroup.github.io/ 3D-R1 .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23478
• PDF: https://arxiv.org/pdf/2507.23478
• Project Page: https://aigeeksgroup.github.io/3D-R1
• Github: https://github.com/AIGeeksGroup/3D-R1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Jul 31
🔹 Abstract: 3D-R1 enhances 3D scene understanding through a high-quality synthetic dataset, reinforcement learning with GRPO, and dynamic view selection, achieving significant improvements in reasoning and generalization. AI-generated summary Large vision-language models ( VLMs ) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding . However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1 , a foundation model that enhances the reasoning capabilities of 3D VLMs . Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K , leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro . It serves as cold-start initialization data for 3D-R1 . Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward , a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding . Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding . Code: https://github.com/AIGeeksGroup/ 3D-R1 . Website: https://aigeeksgroup.github.io/ 3D-R1 .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23478
• PDF: https://arxiv.org/pdf/2507.23478
• Project Page: https://aigeeksgroup.github.io/3D-R1
• Github: https://github.com/AIGeeksGroup/3D-R1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
Multi-Label Knowledge Distillation
🔹 Publication Date: Published on Aug 12, 2023
🔹 Abstract:
The proposed method improves multi-label knowledge distillation by decomposing it into binary classification problems and leveraging label-wise embeddings to enhance feature representation distinctiveness. AI-generated summary Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2308.06453
• PDF: https://arxiv.org/pdf/2308.06453
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Multi-Label Knowledge Distillation
🔹 Publication Date: Published on Aug 12, 2023
🔹 Abstract:
The proposed method improves multi-label knowledge distillation by decomposing it into binary classification problems and leveraging label-wise embeddings to enhance feature representation distinctiveness. AI-generated summary Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. Our code is available at: https://github.com/penghui-yang/L2D
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2308.06453
• PDF: https://arxiv.org/pdf/2308.06453
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - penghui-yang/L2D: [ICCV 2023] Multi-Label Knowledge Distillation
[ICCV 2023] Multi-Label Knowledge Distillation. Contribute to penghui-yang/L2D development by creating an account on GitHub.
❤3👍1
🔹 Title: PixNerd: Pixel Neural Field Diffusion
🔹 Publication Date: Published on Jul 31
🔹 Abstract: Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance. AI-generated summary The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale , single-stage , efficient, end-to-end solution , coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd , we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd -XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23268
• PDF: https://arxiv.org/pdf/2507.23268
• Project Page: https://huggingface.co/spaces/MCG-NJU/PixNerd
• Github: https://github.com/MCG-NJU/PixNerd
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/MCG-NJU/PixNerd
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Jul 31
🔹 Abstract: Pixel Neural Field Diffusion (PixNerd) achieves high-quality image generation in a single-scale, single-stage process without VAEs or complex pipelines, and extends to text-to-image applications with competitive performance. AI-generated summary The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale , single-stage , efficient, end-to-end solution , coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd , we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd -XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23268
• PDF: https://arxiv.org/pdf/2507.23268
• Project Page: https://huggingface.co/spaces/MCG-NJU/PixNerd
• Github: https://github.com/MCG-NJU/PixNerd
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/MCG-NJU/PixNerd
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title: CellForge: Agentic Design of Virtual Cell Models
🔹 Publication Date: Published on Aug 4
🔹 Abstract: CellForge, an agentic system using a multi-agent framework, transforms raw single-cell multi-omics data into optimized computational models for virtual cells, outperforming state-of-the-art methods in single-cell perturbation prediction. AI-generated summary Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design , where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts , drug treatments , and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02276
• PDF: https://arxiv.org/pdf/2508.02276
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 4
🔹 Abstract: CellForge, an agentic system using a multi-agent framework, transforms raw single-cell multi-omics data into optimized computational models for virtual cells, outperforming state-of-the-art methods in single-cell perturbation prediction. AI-generated summary Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design , where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts , drug treatments , and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02276
• PDF: https://arxiv.org/pdf/2508.02276
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2👍1
🔹 Title: Qwen-Image Technical Report
🔹 Publication Date: Published on Aug 4
🔹 Abstract: Qwen-Image, an image generation model, advances text rendering and image editing through a comprehensive data pipeline, progressive training, and dual-encoding mechanism. AI-generated summary We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity . Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02324
• PDF: https://arxiv.org/pdf/2508.02324
• Github: https://github.com/QwenLM/Qwen-Image
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 4
🔹 Abstract: Qwen-Image, an image generation model, advances text rendering and image editing through a comprehensive data pipeline, progressive training, and dual-encoding mechanism. AI-generated summary We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity . Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02324
• PDF: https://arxiv.org/pdf/2508.02324
• Github: https://github.com/QwenLM/Qwen-Image
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
👍2
🔹 Title: ReMoMask: Retrieval-Augmented Masked Motion Generation
🔹 Publication Date: Published on Aug 4
🔹 Abstract: ReMoMask, a unified framework, addresses limitations in text-to-motion generation by integrating a Bidirectional Momentum Text-Motion Model, Semantic Spatio-temporal Attention, and RAG-Classier-Free Guidance, achieving state-of-the-art performance on HumanML3D and KIT-ML benchmarks. AI-generated summary Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues , substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE , ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02605
• PDF: https://arxiv.org/pdf/2508.02605
• Project Page: https://aigeeksgroup.github.io/ReMoMask/
• Github: https://github.com/AIGeeksGroup/ReMoMask
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 4
🔹 Abstract: ReMoMask, a unified framework, addresses limitations in text-to-motion generation by integrating a Bidirectional Momentum Text-Motion Model, Semantic Spatio-temporal Attention, and RAG-Classier-Free Guidance, achieving state-of-the-art performance on HumanML3D and KIT-ML benchmarks. AI-generated summary Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues , substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE , ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02605
• PDF: https://arxiv.org/pdf/2508.02605
• Project Page: https://aigeeksgroup.github.io/ReMoMask/
• Github: https://github.com/AIGeeksGroup/ReMoMask
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2👍1🎉1
🔹 Title: Embedding-Aware Quantum-Classical SVMs for Scalable Quantum Machine Learning
🔹 Publication Date: Published on Jul 28
🔹 Abstract: Combining Vision Transformer embeddings with quantum-classical pipelines achieves quantum advantage in classification tasks, demonstrating the importance of embedding choice in quantum machine learning. AI-generated summary Quantum Support Vector Machines face scalability challenges due to high-dimensional quantum states and hardware limitations. We propose an embedding-aware quantum-classical pipeline combining class-balanced k-means distillation with pretrained Vision Transformer embeddings . Our key finding: ViT embeddings uniquely enable quantum advantage , achieving up to 8.02% accuracy improvements over classical SVMs on Fashion-MNIST and 4.42% on MNIST , while CNN features show performance degradation. Using 16-qubit tensor network simulation via cuTensorNet , we provide the first systematic evidence that quantum kernel advantage depends critically on embedding choice, revealing fundamental synergy between transformer attention and quantum feature spaces . This provides a practical pathway for scalable quantum machine learning that leverages modern neural architectures.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00024
• PDF: https://arxiv.org/pdf/2508.00024
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Jul 28
🔹 Abstract: Combining Vision Transformer embeddings with quantum-classical pipelines achieves quantum advantage in classification tasks, demonstrating the importance of embedding choice in quantum machine learning. AI-generated summary Quantum Support Vector Machines face scalability challenges due to high-dimensional quantum states and hardware limitations. We propose an embedding-aware quantum-classical pipeline combining class-balanced k-means distillation with pretrained Vision Transformer embeddings . Our key finding: ViT embeddings uniquely enable quantum advantage , achieving up to 8.02% accuracy improvements over classical SVMs on Fashion-MNIST and 4.42% on MNIST , while CNN features show performance degradation. Using 16-qubit tensor network simulation via cuTensorNet , we provide the first systematic evidence that quantum kernel advantage depends critically on embedding choice, revealing fundamental synergy between transformer attention and quantum feature spaces . This provides a practical pathway for scalable quantum machine learning that leverages modern neural architectures.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00024
• PDF: https://arxiv.org/pdf/2508.00024
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2🔥1
🔹 Title: Personalized Safety Alignment for Text-to-Image Diffusion Models
🔹 Publication Date: Published on Aug 2
🔹 Abstract: A personalized safety alignment framework integrates user-specific profiles into text-to-image diffusion models to better align generated content with individual safety preferences. AI-generated summary Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA) , a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism . Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://torpedo2648.github.io/PSAlign/.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01151
• PDF: https://arxiv.org/pdf/2508.01151
• Github: https://m-e-agi-lab.github.io/PSAlign/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 2
🔹 Abstract: A personalized safety alignment framework integrates user-specific profiles into text-to-image diffusion models to better align generated content with individual safety preferences. AI-generated summary Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA) , a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism . Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at https://torpedo2648.github.io/PSAlign/.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01151
• PDF: https://arxiv.org/pdf/2508.01151
• Github: https://m-e-agi-lab.github.io/PSAlign/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2🔥1
🔹 Title: Cyber-Zero: Training Cybersecurity Agents without Runtime
🔹 Publication Date: Published on Jul 29
🔹 Abstract: Cyber-Zero synthesizes agent trajectories from CTF writeups to train runtime-free cybersecurity LLMs, achieving state-of-the-art performance on benchmarks. AI-generated summary Large Language Models ( LLMs ) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero , the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs . Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero , we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF , NYU CTF Bench , and Cybench . Our best model, Cyber-Zero-32B , establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00910
• PDF: https://arxiv.org/pdf/2508.00910
• Project Page: https://github.com/amazon-science/Cyber-Zero
• Github: https://github.com/amazon-science/Cyber-Zero
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Jul 29
🔹 Abstract: Cyber-Zero synthesizes agent trajectories from CTF writeups to train runtime-free cybersecurity LLMs, achieving state-of-the-art performance on benchmarks. AI-generated summary Large Language Models ( LLMs ) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero , the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs . Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero , we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF , NYU CTF Bench , and Cybench . Our best model, Cyber-Zero-32B , establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00910
• PDF: https://arxiv.org/pdf/2508.00910
• Project Page: https://github.com/amazon-science/Cyber-Zero
• Github: https://github.com/amazon-science/Cyber-Zero
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3🔥1
🔹 Title:
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
🔹 Publication Date: Published on Jul 4, 2024
🔹 Abstract:
LLM-jp, a collaborative project, develops open-source and powerful Japanese large language models with over 1,500 participants. AI-generated summary This paper introduces LLM-jp , a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp , summaries of its activities, and technical reports on the LLMs developed by LLM-jp . For the latest activities, visit https:// llm-jp .nii.ac.jp/en/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2407.03963
• PDF: https://arxiv.org/pdf/2407.03963
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
🔹 Publication Date: Published on Jul 4, 2024
🔹 Abstract:
LLM-jp, a collaborative project, develops open-source and powerful Japanese large language models with over 1,500 participants. AI-generated summary This paper introduces LLM-jp , a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp , summaries of its activities, and technical reports on the LLMs developed by LLM-jp . For the latest activities, visit https:// llm-jp .nii.ac.jp/en/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2407.03963
• PDF: https://arxiv.org/pdf/2407.03963
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
LLM-jp: A Cross-organizational Project for the Research and...
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs,...
❤2
🔹 Title: Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents
🔹 Publication Date: Published on Jul 31
🔹 Abstract: Reinforcement Learning enhances generalizable spatial reasoning and interaction in 3D environments through cross-view goal specification and automated task synthesis, achieving zero-shot generalization and improved interaction success rates. AI-generated summary While Reinforcement Learning ( RL ) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents . A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL -finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen wo rl ds. Specifically, we explore RL 's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D wo rl ds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-wo rl d settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents ' spatial reasoning .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23698
• PDF: https://arxiv.org/pdf/2507.23698
• Github: https://github.com/CraftJarvis/ROCKET-3
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Jul 31
🔹 Abstract: Reinforcement Learning enhances generalizable spatial reasoning and interaction in 3D environments through cross-view goal specification and automated task synthesis, achieving zero-shot generalization and improved interaction success rates. AI-generated summary While Reinforcement Learning ( RL ) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents . A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL -finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen wo rl ds. Specifically, we explore RL 's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D wo rl ds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by 4times and enables zero-shot generalization of spatial reasoning across diverse environments, including real-wo rl d settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents ' spatial reasoning .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2507.23698
• PDF: https://arxiv.org/pdf/2507.23698
• Github: https://github.com/CraftJarvis/ROCKET-3
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title: Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
🔹 Publication Date: Published on Aug 4
🔹 Abstract: Seed Diffusion Preview, a discrete-state diffusion language model, achieves fast inference speeds through parallel generation, outperforming Mercury and Gemini Diffusion in speed and quality. AI-generated summary We present Seed Diffusion Preview , a large-scale language model based on discrete-state diffusion , offering remarkably fast inference speed. Thanks to non-sequential , parallel generation , discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding , as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02193
• PDF: https://arxiv.org/pdf/2508.02193
• Project Page: https://seed.bytedance.com/en/seed_diffusion
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 4
🔹 Abstract: Seed Diffusion Preview, a discrete-state diffusion language model, achieves fast inference speeds through parallel generation, outperforming Mercury and Gemini Diffusion in speed and quality. AI-generated summary We present Seed Diffusion Preview , a large-scale language model based on discrete-state diffusion , offering remarkably fast inference speed. Thanks to non-sequential , parallel generation , discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding , as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.02193
• PDF: https://arxiv.org/pdf/2508.02193
• Project Page: https://seed.bytedance.com/en/seed_diffusion
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1🔥1
🔹 Title: LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
🔹 Publication Date: Published on Aug 1
🔹 Abstract: LAMIC, a Layout-Aware Multi-Image Composition framework, extends single-reference diffusion models to multi-reference scenarios using attention mechanisms, achieving state-of-the-art performance in controllable image synthesis without training. AI-generated summary In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC , a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio ( IN-R ) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity ( BG-S ) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S , BG-S , IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC 's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC 's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/ LAMIC .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00477
• PDF: https://arxiv.org/pdf/2508.00477
• Github: https://github.com/Suchenl/LAMIC
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 1
🔹 Abstract: LAMIC, a Layout-Aware Multi-Image Composition framework, extends single-reference diffusion models to multi-reference scenarios using attention mechanisms, achieving state-of-the-art performance in controllable image synthesis without training. AI-generated summary In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC , a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio ( IN-R ) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity ( BG-S ) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S , BG-S , IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC 's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC 's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/ LAMIC .
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.00477
• PDF: https://arxiv.org/pdf/2508.00477
• Github: https://github.com/Suchenl/LAMIC
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3🔥1
🔹 Title:
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
🔹 Publication Date: Published on Jul 23
🔹 Abstract:
PRIX, an end-to-end driving architecture using only camera data, achieves state-of-the-art performance with a Context-aware Recalibration Transformer, outperforming larger multimodal planners in efficiency and scalability. AI-generated summary While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR . PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer ( CaRT ), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.17596
• PDF: https://arxiv.org/pdf/2507.17596
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
🔹 Publication Date: Published on Jul 23
🔹 Abstract:
PRIX, an end-to-end driving architecture using only camera data, achieves state-of-the-art performance with a Context-aware Recalibration Transformer, outperforming larger multimodal planners in efficiency and scalability. AI-generated summary While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR . PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer ( CaRT ), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at https://maxiuw.github.io/prix.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.17596
• PDF: https://arxiv.org/pdf/2507.17596
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👍1
https://t.iss.one/InsideAds_bot/open?startapp=r_148350890_utm_source-insideadsInternal-utm_medium-notification-utm_campaign-referralRegistered
if you have channel , make money by using this ads paltform
easy and auto ads posting ( profit: 100$ monthly per channel)
if you have channel , make money by using this ads paltform
easy and auto ads posting ( profit: 100$ monthly per channel)
Telegram
Inside Ads
Smart tool for growth and monetisation of Telegram channels.
Attract subscribers and earn money on your channel (from 100 subscribers). AI will select platforms, advertisers and create ads automatically
Attract subscribers and earn money on your channel (from 100 subscribers). AI will select platforms, advertisers and create ads automatically
🔹 Title: Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
🔹 Publication Date: Published on Aug 6
🔹 Abstract: A lightweight, hyperparameter-free RL algorithm, VL-DAC, enables VLMs to learn generalized policies from inexpensive simulators, improving performance on real-world benchmarks without sacrificing image understanding accuracy. AI-generated summary Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC) , a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time ( MiniWorld , Gym-Cards , ALFWorld , or WebShop ) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.04280
• PDF: https://arxiv.org/pdf/2508.04280
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 6
🔹 Abstract: A lightweight, hyperparameter-free RL algorithm, VL-DAC, enables VLMs to learn generalized policies from inexpensive simulators, improving performance on real-world benchmarks without sacrificing image understanding accuracy. AI-generated summary Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC) , a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time ( MiniWorld , Gym-Cards , ALFWorld , or WebShop ) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.04280
• PDF: https://arxiv.org/pdf/2508.04280
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2👍1
🔹 Title: Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
🔹 Publication Date: Published on Aug 5
🔹 Abstract: Skywork UniPic, a 1.5 billion-parameter autoregressive model, unifies image understanding, text-to-image generation, and image editing with state-of-the-art performance on commodity hardware. AI-generated summary We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding , text-to-image generation , and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing ; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03320
• PDF: https://arxiv.org/pdf/2508.03320
• Github: https://github.com/SkyworkAI/UniPic
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/Skywork/UniPic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 5
🔹 Abstract: Skywork UniPic, a 1.5 billion-parameter autoregressive model, unifies image understanding, text-to-image generation, and image editing with state-of-the-art performance on commodity hardware. AI-generated summary We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding , text-to-image generation , and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing ; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at https://huggingface.co/Skywork/Skywork-UniPic-1.5B.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03320
• PDF: https://arxiv.org/pdf/2508.03320
• Github: https://github.com/SkyworkAI/UniPic
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/Skywork/UniPic
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤🔥1❤1
🔹 Title: SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering
🔹 Publication Date: Published on Aug 5
🔹 Abstract: SonicMaster, a unified generative model, improves music audio quality by addressing various artifacts using text-based control and a flow-matching generative training paradigm. AI-generated summary Music recordings often suffer from audio quality issues such as excessive reverb eration, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization , dynamics , reverb , amplitude , and stereo . Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03448
• PDF: https://arxiv.org/pdf/2508.03448
• Github: https://amaai-lab.github.io/SonicMaster/
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/amaai-lab/SonicMasterDataset
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 5
🔹 Abstract: SonicMaster, a unified generative model, improves music audio quality by addressing various artifacts using text-based control and a flow-matching generative training paradigm. AI-generated summary Music recordings often suffer from audio quality issues such as excessive reverb eration, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization , dynamics , reverb , amplitude , and stereo . Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.03448
• PDF: https://arxiv.org/pdf/2508.03448
• Github: https://amaai-lab.github.io/SonicMaster/
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/amaai-lab/SonicMasterDataset
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title: Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
🔹 Publication Date: Published on Aug 3
🔹 Abstract: Voxlect is a benchmark for evaluating speech foundation models on dialect classification and downstream applications across multiple languages and dialects. AI-generated summary We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models . Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information . We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification , we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems . Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01691
• PDF: https://arxiv.org/pdf/2508.01691
• Github: https://github.com/tiantiaf0627/voxlect/tree/main
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 3
🔹 Abstract: Voxlect is a benchmark for evaluating speech foundation models on dialect classification and downstream applications across multiple languages and dialects. AI-generated summary We present Voxlect, a novel benchmark for modeling dialects and regional languages worldwide using speech foundation models . Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information . We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification , we demonstrate several downstream applications enabled by Voxlect. Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems . Voxlect is publicly available with the license of the RAIL family at: https://github.com/tiantiaf0627/voxlect.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01691
• PDF: https://arxiv.org/pdf/2508.01691
• Github: https://github.com/tiantiaf0627/voxlect/tree/main
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title: A Coarse-to-Fine Approach to Multi-Modality 3D Occupancy Grounding
🔹 Publication Date: Published on Aug 2
🔹 Abstract: A benchmark and model for 3D occupancy grounding using natural language and voxel-level annotations improve object perception in autonomous driving. AI-generated summary Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset , it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding . The dataset is available at https://github.com/RONINGOD/GroundingOcc.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01197
• PDF: https://arxiv.org/pdf/2508.01197
• Github: https://github.com/RONINGOD/GroundingOcc
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date: Published on Aug 2
🔹 Abstract: A benchmark and model for 3D occupancy grounding using natural language and voxel-level annotations improve object perception in autonomous driving. AI-generated summary Visual grounding aims to identify objects or regions in a scene based on natural language descriptions, essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset , it integrates natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multi-modal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head to refine localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding . The dataset is available at https://github.com/RONINGOD/GroundingOcc.
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2508.01197
• PDF: https://arxiv.org/pdf/2508.01197
• Github: https://github.com/RONINGOD/GroundingOcc
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1