πΉ Title:
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
A flow-matching-based model enhances lyrics-to-song generation by providing word-level control over vocal timing and duration, improving quality through aesthetic alignment and surpassing current models in music-specific attributes. AI-generated summary Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models , such as, DiffRhythm , ACE-Step , and LeVo , have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control . To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME . We show that JAM outperforms the existing models in terms of the music-specific attributes.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20880
β’ PDF: https://arxiv.org/pdf/2507.20880
β’ Project Page: https://declare-lab.github.io/jamify
β’ Github: https://declare-lab.github.io/jamify
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
A flow-matching-based model enhances lyrics-to-song generation by providing word-level control over vocal timing and duration, improving quality through aesthetic alignment and surpassing current models in music-specific attributes. AI-generated summary Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models , such as, DiffRhythm , ACE-Step , and LeVo , have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control . To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME . We show that JAM outperforms the existing models in terms of the music-specific attributes.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20880
β’ PDF: https://arxiv.org/pdf/2507.20880
β’ Project Page: https://declare-lab.github.io/jamify
β’ Github: https://declare-lab.github.io/jamify
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
JAM: A Tiny Flow-based Song Generator with Fine-grained...
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio...
β€1
πΉ Title:
Deep Researcher with Test-Time Diffusion
πΉ Publication Date: Published on Jul 21
πΉ Abstract:
The Test-Time Diffusion Deep Researcher (TTD-DR) framework uses a diffusion process with iterative refinement and external information retrieval to generate high-quality research reports, outperforming existing methods. AI-generated summary Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher ( TTD-DR ). This novel framework conceptualizes research report generation as a diffusion process . TTD-DR initiates this process with a preliminary draft , an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow , ensuring the generation of high-quality context for the diffusion process . This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning , significantly outperforming existing deep research agents.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.16075
β’ PDF: https://arxiv.org/pdf/2507.16075
β’ Github: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Deep Researcher with Test-Time Diffusion
πΉ Publication Date: Published on Jul 21
πΉ Abstract:
The Test-Time Diffusion Deep Researcher (TTD-DR) framework uses a diffusion process with iterative refinement and external information retrieval to generate high-quality research reports, outperforming existing methods. AI-generated summary Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher ( TTD-DR ). This novel framework conceptualizes research report generation as a diffusion process . TTD-DR initiates this process with a preliminary draft , an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow , ensuring the generation of high-quality context for the diffusion process . This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning , significantly outperforming existing deep research agents.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.16075
β’ PDF: https://arxiv.org/pdf/2507.16075
β’ Github: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Deep Researcher with Test-Time Diffusion
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic...
β€4
πΉ Title:
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
SmallThinker, designed for localdevices with limited resources, uses advanced architectural innovations to achieve high performance without requiring GPU hardware. AI-generated summary While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts ( MoE ) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20984
β’ PDF: https://arxiv.org/pdf/2507.20984
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
SmallThinker, designed for localdevices with limited resources, uses advanced architectural innovations to achieve high performance without requiring GPU hardware. AI-generated summary While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts ( MoE ) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20984
β’ PDF: https://arxiv.org/pdf/2507.20984
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Title:
Region-based Cluster Discrimination for Visual Representation Learning
πΉ Publication Date: Published on Jul 26
πΉ Abstract:
RICE enhances region-level visual and OCR capabilities through a novel Region Transformer and cluster discrimination loss, achieving superior performance across dense prediction and perception tasks. AI-generated summary Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation . To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation , dense detection , and visual perception for Multimodal Large Language Models (MLLMs) . The pre-trained models have been released at https://github.com/deepglint/MVT.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20025
β’ PDF: https://arxiv.org/pdf/2507.20025
β’ Github: https://github.com/deepglint/MVT
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Region-based Cluster Discrimination for Visual Representation Learning
πΉ Publication Date: Published on Jul 26
πΉ Abstract:
RICE enhances region-level visual and OCR capabilities through a novel Region Transformer and cluster discrimination loss, achieving superior performance across dense prediction and perception tasks. AI-generated summary Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation . To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation , dense detection , and visual perception for Multimodal Large Language Models (MLLMs) . The pre-trained models have been released at https://github.com/deepglint/MVT.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20025
β’ PDF: https://arxiv.org/pdf/2507.20025
β’ Github: https://github.com/deepglint/MVT
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
Ovis-U1 Technical Report
πΉ Publication Date: Published on Jun 29
πΉ Abstract:
Ovis-U1, a 3-billion-parameter model, combines multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance in various benchmarks. AI-generated summary In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN , respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.23044
β’ PDF: https://arxiv.org/pdf/2506.23044
β’ Github: https://github.com/AIDC-AI/Ovis-U1
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
β’ https://huggingface.co/spaces/evalstate/Ovis-U1-3B
β’ https://huggingface.co/spaces/LLMTestSaurav/Ovis-U1-Demo
β’ https://huggingface.co/spaces/innoai/Ovis-U1-3B-cpu
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Ovis-U1 Technical Report
πΉ Publication Date: Published on Jun 29
πΉ Abstract:
Ovis-U1, a 3-billion-parameter model, combines multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance in various benchmarks. AI-generated summary In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN , respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.23044
β’ PDF: https://arxiv.org/pdf/2506.23044
β’ Github: https://github.com/AIDC-AI/Ovis-U1
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
β’ https://huggingface.co/spaces/evalstate/Ovis-U1-3B
β’ https://huggingface.co/spaces/LLMTestSaurav/Ovis-U1-Demo
β’ https://huggingface.co/spaces/innoai/Ovis-U1-3B-cpu
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Ovis-U1 Technical Report
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the...
β€3
πΉ Title:
AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation
πΉ Publication Date: Published on Jul 23
πΉ Abstract:
The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with low-resolution logits and integrating high-frequency components, leading to improved segmentation performance. AI-generated summary In Unsupervised Domain Adaptive Semantic Segmentation ( UDA-SS ), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement ( AFR ) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits . AFR also integrates high-frequency components , which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods , leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia --> Cityscapes . The implementation of our framework is available at: https://github.com/Masrur02/ AFR DA
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.17957
β’ PDF: https://arxiv.org/pdf/2507.17957
β’ Github: https://github.com/Masrur02/AFRDA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation
πΉ Publication Date: Published on Jul 23
πΉ Abstract:
The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with low-resolution logits and integrating high-frequency components, leading to improved segmentation performance. AI-generated summary In Unsupervised Domain Adaptive Semantic Segmentation ( UDA-SS ), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement ( AFR ) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits . AFR also integrates high-frequency components , which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods , leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia --> Cityscapes . The implementation of our framework is available at: https://github.com/Masrur02/ AFR DA
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.17957
β’ PDF: https://arxiv.org/pdf/2507.17957
β’ Github: https://github.com/Masrur02/AFRDA
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
GitHub
Masrur02 - Overview
Robotics enthusiastic. Masrur02 has 16 repositories available. Follow their code on GitHub.
β€1
Forwarded from Python | Machine Learning | Coding | R
5 minutes of work - 127,000$ profit!
Opened access to the Jay Welcome Club where the AI bot does all the work itselfπ»
Usually you pay crazy money to get into this club, but today access is free for everyone!
23,432% on deposit earned by club members in the last 6 monthsπ
Just follow Jay's trades and earn! π
https://t.iss.one/+mONXtEgVxtU5NmZl
Opened access to the Jay Welcome Club where the AI bot does all the work itselfπ»
Usually you pay crazy money to get into this club, but today access is free for everyone!
23,432% on deposit earned by club members in the last 6 monthsπ
Just follow Jay's trades and earn! π
https://t.iss.one/+mONXtEgVxtU5NmZl
πΉ Title:
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
πΉ Publication Date: Published on May 23, 2024
πΉ Abstract:
A novel Offline RL method uses Contrastive Predictive Coding to handle non-stationary transition and reward functions in datasets, outperforming baselines in various control tasks. AI-generated summary Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks . We show that our method often achieves the oracle performance and performs better than baselines.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2405.14114
β’ PDF: https://arxiv.org/pdf/2405.14114
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/johannesack/OfflineRLStructuredNonstationary
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
πΉ Publication Date: Published on May 23, 2024
πΉ Abstract:
A novel Offline RL method uses Contrastive Predictive Coding to handle non-stationary transition and reward functions in datasets, outperforming baselines in various control tasks. AI-generated summary Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks . We show that our method often achieves the oracle performance and performs better than baselines.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2405.14114
β’ PDF: https://arxiv.org/pdf/2405.14114
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/johannesack/OfflineRLStructuredNonstationary
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Title: MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
πΉ Publication Date:
Published on Jul 29
πΉ Abstract:
A systematic assessment of honesty in Multimodal Large Language Models (MLLMs) using a large-scale benchmark reveals that models often fail to appropriately refuse unanswerable visual questions, highlighting the need for multimodal honesty alignment methods. AI-generated summary Recently Multimodal Large Language Models ( MLLMs ) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs . We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench , a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples , whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench , we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs . Our data and code can be found at https://github.com/DSTTSD/ MoHoBench .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.21503
β’ PDF: https://arxiv.org/pdf/2507.21503
β’ Github: https://github.com/DSTTSD/MoHoBench
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date:
Published on Jul 29
πΉ Abstract:
A systematic assessment of honesty in Multimodal Large Language Models (MLLMs) using a large-scale benchmark reveals that models often fail to appropriately refuse unanswerable visual questions, highlighting the need for multimodal honesty alignment methods. AI-generated summary Recently Multimodal Large Language Models ( MLLMs ) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs . We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench , a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples , whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench , we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs . Our data and code can be found at https://github.com/DSTTSD/ MoHoBench .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.21503
β’ PDF: https://arxiv.org/pdf/2507.21503
β’ Github: https://github.com/DSTTSD/MoHoBench
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
πΉ Title: X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
πΉ Publication Date:
Published on Jul 29
πΉ Abstract:
Reinforcement learning enhances discrete autoregressive modeling for image and language generation, achieving high-quality image generation and instruction-following capabilities. AI-generated summary Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity , distorted outputs , and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation . Our framework comprises a semantic image tokenizer , a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation , termed X-Omni . X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.22058
β’ PDF: https://arxiv.org/pdf/2507.22058
β’ Project Page: https://x-omni-team.github.io
β’ Github: https://github.com/X-Omni-Team/X-Omni
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/zhangxiaosong18/X-Omni-En
β’ https://huggingface.co/spaces/zhangxiaosong18/X-Omni-Zh
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date:
Published on Jul 29
πΉ Abstract:
Reinforcement learning enhances discrete autoregressive modeling for image and language generation, achieving high-quality image generation and instruction-following capabilities. AI-generated summary Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity , distorted outputs , and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation . Our framework comprises a semantic image tokenizer , a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation , termed X-Omni . X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.22058
β’ PDF: https://arxiv.org/pdf/2507.22058
β’ Project Page: https://x-omni-team.github.io
β’ Github: https://github.com/X-Omni-Team/X-Omni
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/zhangxiaosong18/X-Omni-En
β’ https://huggingface.co/spaces/zhangxiaosong18/X-Omni-Zh
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates
πΉ Publication Date: Published on Mar 10
πΉ Abstract:
FedRand framework enhances data privacy in federated learning by keeping a subset of LoRA parameters private, reducing the risk of membership inference attacks while maintaining model accuracy. AI-generated summary Federated Learning (FL) is a widely used framework for training models in a decentralized manner, ensuring that the central server does not have direct access to data from local clients. However, this approach may still fail to fully preserve data privacy, as models from local clients are exposed to the central server during the aggregation process. This issue becomes even more critical when training vision-language models (VLMs) with FL, as VLMs can easily memorize training data instances, making them vulnerable to membership inference attacks (MIAs). To address this challenge, we propose the FedRand framework, which avoids disclosing the full set of client parameters. In this framework, each client randomly selects subparameters of Low-Rank Adaptation ( LoRA ) from the server and keeps the remaining counterparts of the LoRA weights as private parameters. After training both parameters on the client's private dataset, only the non-private client parameters are sent back to the server for aggregation. This approach mitigates the risk of exposing client-side VLM parameters, thereby enhancing data privacy. We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines while achieving accuracy comparable to methods that communicate full LoRA parameters across several benchmark datasets.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2503.07216
β’ PDF: https://arxiv.org/pdf/2503.07216
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates
πΉ Publication Date: Published on Mar 10
πΉ Abstract:
FedRand framework enhances data privacy in federated learning by keeping a subset of LoRA parameters private, reducing the risk of membership inference attacks while maintaining model accuracy. AI-generated summary Federated Learning (FL) is a widely used framework for training models in a decentralized manner, ensuring that the central server does not have direct access to data from local clients. However, this approach may still fail to fully preserve data privacy, as models from local clients are exposed to the central server during the aggregation process. This issue becomes even more critical when training vision-language models (VLMs) with FL, as VLMs can easily memorize training data instances, making them vulnerable to membership inference attacks (MIAs). To address this challenge, we propose the FedRand framework, which avoids disclosing the full set of client parameters. In this framework, each client randomly selects subparameters of Low-Rank Adaptation ( LoRA ) from the server and keeps the remaining counterparts of the LoRA weights as private parameters. After training both parameters on the client's private dataset, only the non-private client parameters are sent back to the server for aggregation. This approach mitigates the risk of exposing client-side VLM parameters, thereby enhancing data privacy. We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines while achieving accuracy comparable to methods that communicate full LoRA parameters across several benchmark datasets.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2503.07216
β’ PDF: https://arxiv.org/pdf/2503.07216
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title: CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
πΉ Publication Date: Published on Jul 18
πΉ Abstract: CUDA-L1, an automated reinforcement learning framework, significantly improves CUDA optimization across various GPU architectures, achieving substantial speedups without human expertise. AI-generated summary The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1 , an automated reinforcement learning framework for CUDA optimization . CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100 , it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench , with peak speedup s reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures , achieving average speedup s of x17.8 on H100 , x19.0 on RTX 3090 , x16.5 on L40 , x14.7 on H800 , and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization ; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup -based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.14111
β’ PDF: https://arxiv.org/pdf/2507.14111
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/deepreinforce-ai/CUDA-L1
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date: Published on Jul 18
πΉ Abstract: CUDA-L1, an automated reinforcement learning framework, significantly improves CUDA optimization across various GPU architectures, achieving substantial speedups without human expertise. AI-generated summary The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1 , an automated reinforcement learning framework for CUDA optimization . CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100 , it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench , with peak speedup s reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures , achieving average speedup s of x17.8 on H100 , x19.0 on RTX 3090 , x16.5 on L40 , x14.7 on H800 , and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization ; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup -based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.14111
β’ PDF: https://arxiv.org/pdf/2507.14111
πΉ Datasets citing this paper:
β’ https://huggingface.co/datasets/deepreinforce-ai/CUDA-L1
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
Forwarded from Python | Machine Learning | Coding | R
Join our WhatsApp channel
There are dedicated resources only for WhatsApp users
https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
There are dedicated resources only for WhatsApp users
https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
β€1
Forwarded from Python | Machine Learning | Coding | R
π Become an Agentic AI Builder β Free 12βWeek Certification by Ready Tensor
Ready Tensorβs Agentic AI Developer Certification is a free, project first 12βweek program designed to help you build and deploy real-world agentic AI systems. You'll complete three portfolio-ready projects using tools like LangChain, LangGraph, and vector databases, while deploying production-ready agents with FastAPI or Streamlit.
The course focuses on developing autonomous AI agents that can plan, reason, use memory, and act safely in complex environments. Certification is earned not by watching lectures, but by building β each project is reviewed against rigorous standards.
You can start anytime, and new cohorts begin monthly. Ideal for developers and engineers ready to go beyond chat prompts and start building true agentic systems.
π Apply now: https://www.readytensor.ai/agentic-ai-cert/
Ready Tensorβs Agentic AI Developer Certification is a free, project first 12βweek program designed to help you build and deploy real-world agentic AI systems. You'll complete three portfolio-ready projects using tools like LangChain, LangGraph, and vector databases, while deploying production-ready agents with FastAPI or Streamlit.
The course focuses on developing autonomous AI agents that can plan, reason, use memory, and act safely in complex environments. Certification is earned not by watching lectures, but by building β each project is reviewed against rigorous standards.
You can start anytime, and new cohorts begin monthly. Ideal for developers and engineers ready to go beyond chat prompts and start building true agentic systems.
π Apply now: https://www.readytensor.ai/agentic-ai-cert/
www.readytensor.ai
Agentic AI Developer Certification Program by Ready Tensor
Learn to build chatbots, AI assistants, and multi-agent systems with Ready Tensor's free, self-paced, and beginner-friendly Agentic AI Developer Certification. View the full program guide and how to get certified.
πΉ Title:
Music Arena: Live Evaluation for Text-to-Music
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
Music Arena provides a scalable, interactive platform for evaluating text-to-music models through user-generated preferences and detailed feedback. AI-generated summary We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering *live* evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of *detailed* preferences including listening data and natural language feedback . We also propose a rolling data release policy with user privacy guarantees , providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol , transparent data access policies , and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20900
β’ PDF: https://arxiv.org/pdf/2507.20900
β’ Github: https://github.com/gclef-cmu/music-arena
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Music Arena: Live Evaluation for Text-to-Music
πΉ Publication Date: Published on Jul 28
πΉ Abstract:
Music Arena provides a scalable, interactive platform for evaluating text-to-music models through user-generated preferences and detailed feedback. AI-generated summary We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering *live* evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of *detailed* preferences including listening data and natural language feedback . We also propose a rolling data release policy with user privacy guarantees , providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol , transparent data access policies , and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: https://music-arena.org
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2507.20900
β’ PDF: https://arxiv.org/pdf/2507.20900
β’ Github: https://github.com/gclef-cmu/music-arena
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
beta.music-arena.org
Music Arena
Click to try out the app!
β€2
πΉ Title:
Privacy-Aware Energy Consumption Modeling of Connected Battery Electric Vehicles using Federated Learning
πΉ Publication Date: Published on Dec 12, 2023
πΉ Abstract:
Federated Learning methods like FedAvg and FedPer improve BEV energy consumption prediction while protecting user privacy. AI-generated summary Battery Electric Vehicles (BEVs) are increasingly significant in modern cities due to their potential to reduce air pollution. Precise and real-time estimation of energy consumption for them is imperative for effective itinerary planning and optimizing vehicle systems, which can reduce driving range anxiety and decrease energy costs. As public awareness of data privacy increases, adopting approaches that safeguard data privacy in the context of BEV energy consumption modeling is crucial. Federated Learning ( FL ) is a promising solution mitigating the risk of exposing sensitive information to third parties by allowing local data to remain on devices and only sharing model updates with a central server. Our work investigates the potential of using FL methods, such as FedAvg , and FedPer , to improve BEV energy consumption prediction while maintaining user privacy. We conducted experiments using data from 10 BEVs under simulated real-world driving conditions. Our results demonstrate that the FedAvg - LSTM model achieved a reduction of up to 67.84\% in the MAE value of the prediction results. Furthermore, we explored various real-world scenarios and discussed how FL methods can be employed in those cases. Our findings show that FL methods can effectively improve the performance of BEV energy consumption prediction while maintaining user privacy.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2312.07371
β’ PDF: https://arxiv.org/pdf/2312.07371
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Privacy-Aware Energy Consumption Modeling of Connected Battery Electric Vehicles using Federated Learning
πΉ Publication Date: Published on Dec 12, 2023
πΉ Abstract:
Federated Learning methods like FedAvg and FedPer improve BEV energy consumption prediction while protecting user privacy. AI-generated summary Battery Electric Vehicles (BEVs) are increasingly significant in modern cities due to their potential to reduce air pollution. Precise and real-time estimation of energy consumption for them is imperative for effective itinerary planning and optimizing vehicle systems, which can reduce driving range anxiety and decrease energy costs. As public awareness of data privacy increases, adopting approaches that safeguard data privacy in the context of BEV energy consumption modeling is crucial. Federated Learning ( FL ) is a promising solution mitigating the risk of exposing sensitive information to third parties by allowing local data to remain on devices and only sharing model updates with a central server. Our work investigates the potential of using FL methods, such as FedAvg , and FedPer , to improve BEV energy consumption prediction while maintaining user privacy. We conducted experiments using data from 10 BEVs under simulated real-world driving conditions. Our results demonstrate that the FedAvg - LSTM model achieved a reduction of up to 67.84\% in the MAE value of the prediction results. Furthermore, we explored various real-world scenarios and discussed how FL methods can be employed in those cases. Our findings show that FL methods can effectively improve the performance of BEV energy consumption prediction while maintaining user privacy.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2312.07371
β’ PDF: https://arxiv.org/pdf/2312.07371
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Privacy-Aware Energy Consumption Modeling of Connected Battery...
Battery Electric Vehicles (BEVs) are increasingly significant in modern cities due to their potential to reduce air pollution. Precise and real-time estimation of energy consumption for them is...
β€2
πΉ Title: Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
πΉ Publication Date: Published on Jul 31
πΉ Abstract: Seed-Prover, a lemma-style reasoning model using Lean, achieves high performance in formal theorem proving and automated mathematical reasoning through iterative refinement and specialized geometry support. AI-generated summary LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought , yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning . In this work, we propose Seed-Prover , a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves 78.1% of formalized past IMO problems, saturates MiniF2F , and achieves over 50\% on PutnamBench , outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine Seed-Geometry , which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning , demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.23726
β’ PDF: https://arxiv.org/pdf/2507.23726
β’ Github: https://github.com/ByteDance-Seed/Seed-Prover
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date: Published on Jul 31
πΉ Abstract: Seed-Prover, a lemma-style reasoning model using Lean, achieves high performance in formal theorem proving and automated mathematical reasoning through iterative refinement and specialized geometry support. AI-generated summary LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought , yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning . In this work, we propose Seed-Prover , a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves 78.1% of formalized past IMO problems, saturates MiniF2F , and achieves over 50\% on PutnamBench , outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine Seed-Geometry , which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning , demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.23726
β’ PDF: https://arxiv.org/pdf/2507.23726
β’ Github: https://github.com/ByteDance-Seed/Seed-Prover
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title: RecGPT Technical Report
πΉ Publication Date: Published on Jul 30
πΉ Abstract: RecGPT integrates large language models into recommender systems to focus on user intent, improving content diversity and satisfaction while enhancing merchant and platform performance. AI-generated summary Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent . This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models ( LLMs ) into key stages of user interest mining , item retrieval , and explanation generation , RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution , guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.22879
β’ PDF: https://arxiv.org/pdf/2507.22879
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date: Published on Jul 30
πΉ Abstract: RecGPT integrates large language models into recommender systems to focus on user intent, improving content diversity and satisfaction while enhancing merchant and platform performance. AI-generated summary Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent . This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating large language models ( LLMs ) into key stages of user interest mining , item retrieval , and explanation generation , RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution , guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.22879
β’ PDF: https://arxiv.org/pdf/2507.22879
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title: Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification
πΉ Publication Date: Published on Jul 31
πΉ Abstract: Enhancing dual-teacher self-supervised frameworks with Kolmogorov-Arnold Networks improves art style classification by better modeling nonlinear feature correlations and disentangling complex style manifolds. AI-generated summary Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks ( KANs ). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs ' spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds , leading to better linear probe accuracy than MLP projection s.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.23436
β’ PDF: https://arxiv.org/pdf/2507.23436
β’ Project Page: https://huggingface.co/papers?q=MLP%20projection
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
πΉ Publication Date: Published on Jul 31
πΉ Abstract: Enhancing dual-teacher self-supervised frameworks with Kolmogorov-Arnold Networks improves art style classification by better modeling nonlinear feature correlations and disentangling complex style manifolds. AI-generated summary Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks ( KANs ). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs ' spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds , leading to better linear probe accuracy than MLP projection s.
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2507.23436
β’ PDF: https://arxiv.org/pdf/2507.23436
β’ Project Page: https://huggingface.co/papers?q=MLP%20projection
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
πΉ Publication Date: Published on Jul 7
πΉ Abstract:
A high-speed 4D capturing system using low FPS cameras with asynchronous capture and video-diffusion-based artifact correction enhances reconstruction quality. AI-generated summary Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency , and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/4dslomo-4d-reconstruction-for-high-speed-scene-with-asynchronous-capture
β’ PDF: https://arxiv.org/pdf/2507.05163
β’ Github: https://openimaginglab.github.io/4DSloMo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
πΉ Publication Date: Published on Jul 7
πΉ Abstract:
A high-speed 4D capturing system using low FPS cameras with asynchronous capture and video-diffusion-based artifact correction enhances reconstruction quality. AI-generated summary Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency , and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
πΉ Links:
β’ arXiv Page: https://arxivexplained.com/papers/4dslomo-4d-reconstruction-for-high-speed-scene-with-asynchronous-capture
β’ PDF: https://arxiv.org/pdf/2507.05163
β’ Github: https://openimaginglab.github.io/4DSloMo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€5π₯1