🔹 Title:
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
🔹 Publication Date: Published on Jul 15
🔹 Abstract:
DrafterBench is an open-source benchmark for evaluating LLM agents in technical drawing revision, assessing their capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. AI-generated summary Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmark s are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision , a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness . The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution , instruction following , and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11527
• PDF: https://arxiv.org/pdf/2507.11527
• Github: https://github.com/Eason-Li-AIS/DrafterBench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Eason666/DrafterBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
🔹 Publication Date: Published on Jul 15
🔹 Abstract:
DrafterBench is an open-source benchmark for evaluating LLM agents in technical drawing revision, assessing their capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. AI-generated summary Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmark s are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision , a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness . The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution , instruction following , and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.11527
• PDF: https://arxiv.org/pdf/2507.11527
• Github: https://github.com/Eason-Li-AIS/DrafterBench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/Eason666/DrafterBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
Calligrapher: Freestyle Text Image Customization
🔹 Publication Date: Published on Jun 30
🔹 Abstract:
Calligrapher uses a diffusion-based framework with self-distillation and localized style injection to generate high-quality, stylistically consistent digital typography. AI-generated summary We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder , which comprises both Qformer and linear layers , to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process , further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.24123
• PDF: https://arxiv.org/pdf/2506.24123
• Project Page: https://calligrapher2025.github.io/Calligrapher/
• Github: https://github.com/Calligrapher2025/Calligrapher
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Calligrapher: Freestyle Text Image Customization
🔹 Publication Date: Published on Jun 30
🔹 Abstract:
Calligrapher uses a diffusion-based framework with self-distillation and localized style injection to generate high-quality, stylistically consistent digital typography. AI-generated summary We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder , which comprises both Qformer and linear layers , to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process , further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.24123
• PDF: https://arxiv.org/pdf/2506.24123
• Project Page: https://calligrapher2025.github.io/Calligrapher/
• Github: https://github.com/Calligrapher2025/Calligrapher
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
🔹 Publication Date: Published on Jul 14
🔹 Abstract:
A large-scale dataset named SpeakerVid-5M is introduced for audio-visual dyadic interactive virtual human generation, featuring diverse interactions and high-quality data for various virtual human tasks. AI-generated summary The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking , listening , and dyadic conversations . Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types ( dialogue branch , single branch , listening branch and multi-turn branch ) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/ SpeakerVid-5M /
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.09862
• PDF: https://arxiv.org/pdf/2507.09862
• Project Page: https://dorniwang.github.io/SpeakerVid-5M/
• Github: https://dorniwang.github.io/SpeakerVid-5M/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
🔹 Publication Date: Published on Jul 14
🔹 Abstract:
A large-scale dataset named SpeakerVid-5M is introduced for audio-visual dyadic interactive virtual human generation, featuring diverse interactions and high-quality data for various virtual human tasks. AI-generated summary The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking , listening , and dyadic conversations . Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types ( dialogue branch , single branch , listening branch and multi-turn branch ) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/ SpeakerVid-5M /
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.09862
• PDF: https://arxiv.org/pdf/2507.09862
• Project Page: https://dorniwang.github.io/SpeakerVid-5M/
• Github: https://dorniwang.github.io/SpeakerVid-5M/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
dorniwang.github.io
Duomin Wang (王多民)'s Homepage
Duomin Wang's Homepage
❤2
🔹 Title:
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
🔹 Publication Date: Published on Jul 10
🔹 Abstract:
OST-Bench evaluates multimodal large language models in online spatio-temporal reasoning tasks, revealing challenges in handling complex spatial cues and long-term memory in real-world scenarios. AI-generated summary Recent advances in multimodal large language models ( MLLMs ) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench , a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet , Matterport3D , and ARKitScenes . We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning . Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07984
• PDF: https://arxiv.org/pdf/2507.07984
• Project Page: https://rbler1234.github.io/OSTBench.github.io/
• Github: https://github.com/OpenRobotLab/OST-Bench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/rbler/OST-Bench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
🔹 Publication Date: Published on Jul 10
🔹 Abstract:
OST-Bench evaluates multimodal large language models in online spatio-temporal reasoning tasks, revealing challenges in handling complex spatial cues and long-term memory in real-world scenarios. AI-generated summary Recent advances in multimodal large language models ( MLLMs ) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench , a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet , Matterport3D , and ARKitScenes . We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning . Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.07984
• PDF: https://arxiv.org/pdf/2507.07984
• Project Page: https://rbler1234.github.io/OSTBench.github.io/
• Github: https://github.com/OpenRobotLab/OST-Bench
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/rbler/OST-Bench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
rbler1234.github.io
TWITTER BANNER TITLE META TAG
TWITTER BANNER DESCRIPTION META TAG
❤1
important channel to get a job
🔹 Title:
Agentic Reinforced Policy Optimization
🔹 Publication Date: Published on Jul 26
🔹 Abstract:
Agentic Reinforced Policy Optimization (ARPO) is a novel RL algorithm that enhances multi-turn LLM-based agents by adaptive uncertainty management and advantage attribution, outperforming trajectory-level RL algorithms with reduced resource usage. AI-generated summary Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models ( LLMs ) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism , dynamically balancing global trajectory sampling and step-level sampling , thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation , ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning , knowledge reasoning , and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.19849
• PDF: https://arxiv.org/pdf/2507.19849
• Project Page: https://github.com/dongguanting/ARPO
• Github: https://github.com/dongguanting/ARPO
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/dongguanting/ARPO-SFT-54K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-DeepSearch-1K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-Reasoning-10K
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Agentic Reinforced Policy Optimization
🔹 Publication Date: Published on Jul 26
🔹 Abstract:
Agentic Reinforced Policy Optimization (ARPO) is a novel RL algorithm that enhances multi-turn LLM-based agents by adaptive uncertainty management and advantage attribution, outperforming trajectory-level RL algorithms with reduced resource usage. AI-generated summary Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models ( LLMs ) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism , dynamically balancing global trajectory sampling and step-level sampling , thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation , ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning , knowledge reasoning , and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.19849
• PDF: https://arxiv.org/pdf/2507.19849
• Project Page: https://github.com/dongguanting/ARPO
• Github: https://github.com/dongguanting/ARPO
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/dongguanting/ARPO-SFT-54K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-DeepSearch-1K
• https://huggingface.co/datasets/dongguanting/ARPO-RL-Reasoning-10K
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
🔹 Title:
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A multimodal model that processes visual, audio, and textual signals for structured comprehension of real-world short videos improves video search, recommendation, and engagement. AI-generated summary Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal model s lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization , open-ended video question answering , temporal video grounding , and video reasoning . Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning , cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning . Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20939
• PDF: https://arxiv.org/pdf/2507.20939
• Project Page: https://tencentarc.github.io/posts/arc-video-announcement/
• Github: https://github.com/TencentARC/ARC-Hunyuan-Video-7B
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A multimodal model that processes visual, audio, and textual signals for structured comprehension of real-world short videos improves video search, recommendation, and engagement. AI-generated summary Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal model s lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization , open-ended video question answering , temporal video grounding , and video reasoning . Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning , cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning . Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20939
• PDF: https://arxiv.org/pdf/2507.20939
• Project Page: https://tencentarc.github.io/posts/arc-video-announcement/
• Github: https://github.com/TencentARC/ARC-Hunyuan-Video-7B
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack...
❤2
🔹 Title:
Reconstructing 4D Spatial Intelligence: A Survey
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A survey organizes methods for reconstructing 4D spatial intelligence from visual observations into five progressive levels, offering analysis and identifying future research directions. AI-generated summary Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures , the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction . To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth , pose , and point maps ); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects , humans , structures ); (3) Level 3 -- reconstruction of 4D dynamic scenes ; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints . We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21045
• PDF: https://arxiv.org/pdf/2507.21045
• Github: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Reconstructing 4D Spatial Intelligence: A Survey
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A survey organizes methods for reconstructing 4D spatial intelligence from visual observations into five progressive levels, offering analysis and identifying future research directions. AI-generated summary Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures , the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction . To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth , pose , and point maps ); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects , humans , structures ); (3) Level 3 -- reconstruction of 4D dynamic scenes ; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints . We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21045
• PDF: https://arxiv.org/pdf/2507.21045
• Github: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - yukangcao/Awesome-4D-Spatial-Intelligence: A curated list of awesome papers for reconstructing 4D spatial intelligence…
A curated list of awesome papers for reconstructing 4D spatial intelligence from video. (arXiv 2507.21045) - yukangcao/Awesome-4D-Spatial-Intelligence
❤4
🔹 Title:
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
Rep-MTL optimizes multi-task learning by leveraging task saliency in shared representations to promote complementarity and reduce negative transfer. AI-generated summary Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization ( MTO ) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space , where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO . This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment , Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21049
• PDF: https://arxiv.org/pdf/2507.21049
• Project Page: https://jacky1128.github.io/RepMTL/
• Github: https://github.com/Jacky1128/Rep-MTL
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
Rep-MTL optimizes multi-task learning by leveraging task saliency in shared representations to promote complementarity and reduce negative transfer. AI-generated summary Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization ( MTO ) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space , where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO . This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment , Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21049
• PDF: https://arxiv.org/pdf/2507.21049
• Project Page: https://jacky1128.github.io/RepMTL/
• Github: https://github.com/Jacky1128/Rep-MTL
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4👍1
🔹 Title:
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A flow-matching-based model enhances lyrics-to-song generation by providing word-level control over vocal timing and duration, improving quality through aesthetic alignment and surpassing current models in music-specific attributes. AI-generated summary Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models , such as, DiffRhythm , ACE-Step , and LeVo , have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control . To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME . We show that JAM outperforms the existing models in terms of the music-specific attributes.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20880
• PDF: https://arxiv.org/pdf/2507.20880
• Project Page: https://declare-lab.github.io/jamify
• Github: https://declare-lab.github.io/jamify
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
A flow-matching-based model enhances lyrics-to-song generation by providing word-level control over vocal timing and duration, improving quality through aesthetic alignment and surpassing current models in music-specific attributes. AI-generated summary Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models , such as, DiffRhythm , ACE-Step , and LeVo , have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control . To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME . We show that JAM outperforms the existing models in terms of the music-specific attributes.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20880
• PDF: https://arxiv.org/pdf/2507.20880
• Project Page: https://declare-lab.github.io/jamify
• Github: https://declare-lab.github.io/jamify
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
JAM: A Tiny Flow-based Song Generator with Fine-grained...
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio...
❤1
🔹 Title:
Deep Researcher with Test-Time Diffusion
🔹 Publication Date: Published on Jul 21
🔹 Abstract:
The Test-Time Diffusion Deep Researcher (TTD-DR) framework uses a diffusion process with iterative refinement and external information retrieval to generate high-quality research reports, outperforming existing methods. AI-generated summary Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher ( TTD-DR ). This novel framework conceptualizes research report generation as a diffusion process . TTD-DR initiates this process with a preliminary draft , an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow , ensuring the generation of high-quality context for the diffusion process . This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning , significantly outperforming existing deep research agents.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.16075
• PDF: https://arxiv.org/pdf/2507.16075
• Github: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Deep Researcher with Test-Time Diffusion
🔹 Publication Date: Published on Jul 21
🔹 Abstract:
The Test-Time Diffusion Deep Researcher (TTD-DR) framework uses a diffusion process with iterative refinement and external information retrieval to generate high-quality research reports, outperforming existing methods. AI-generated summary Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher ( TTD-DR ). This novel framework conceptualizes research report generation as a diffusion process . TTD-DR initiates this process with a preliminary draft , an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow , ensuring the generation of high-quality context for the diffusion process . This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning , significantly outperforming existing deep research agents.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.16075
• PDF: https://arxiv.org/pdf/2507.16075
• Github: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Deep Researcher with Test-Time Diffusion
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic...
❤4
🔹 Title:
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
SmallThinker, designed for localdevices with limited resources, uses advanced architectural innovations to achieve high performance without requiring GPU hardware. AI-generated summary While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts ( MoE ) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20984
• PDF: https://arxiv.org/pdf/2507.20984
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
🔹 Publication Date: Published on Jul 28
🔹 Abstract:
SmallThinker, designed for localdevices with limited resources, uses advanced architectural innovations to achieve high performance without requiring GPU hardware. AI-generated summary While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts ( MoE ) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20984
• PDF: https://arxiv.org/pdf/2507.20984
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Title:
Region-based Cluster Discrimination for Visual Representation Learning
🔹 Publication Date: Published on Jul 26
🔹 Abstract:
RICE enhances region-level visual and OCR capabilities through a novel Region Transformer and cluster discrimination loss, achieving superior performance across dense prediction and perception tasks. AI-generated summary Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation . To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation , dense detection , and visual perception for Multimodal Large Language Models (MLLMs) . The pre-trained models have been released at https://github.com/deepglint/MVT.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20025
• PDF: https://arxiv.org/pdf/2507.20025
• Github: https://github.com/deepglint/MVT
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Region-based Cluster Discrimination for Visual Representation Learning
🔹 Publication Date: Published on Jul 26
🔹 Abstract:
RICE enhances region-level visual and OCR capabilities through a novel Region Transformer and cluster discrimination loss, achieving superior performance across dense prediction and perception tasks. AI-generated summary Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation . To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation , dense detection , and visual perception for Multimodal Large Language Models (MLLMs) . The pre-trained models have been released at https://github.com/deepglint/MVT.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.20025
• PDF: https://arxiv.org/pdf/2507.20025
• Github: https://github.com/deepglint/MVT
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
Ovis-U1 Technical Report
🔹 Publication Date: Published on Jun 29
🔹 Abstract:
Ovis-U1, a 3-billion-parameter model, combines multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance in various benchmarks. AI-generated summary In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN , respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.23044
• PDF: https://arxiv.org/pdf/2506.23044
• Github: https://github.com/AIDC-AI/Ovis-U1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
• https://huggingface.co/spaces/evalstate/Ovis-U1-3B
• https://huggingface.co/spaces/LLMTestSaurav/Ovis-U1-Demo
• https://huggingface.co/spaces/innoai/Ovis-U1-3B-cpu
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Ovis-U1 Technical Report
🔹 Publication Date: Published on Jun 29
🔹 Abstract:
Ovis-U1, a 3-billion-parameter model, combines multimodal understanding, text-to-image generation, and image editing, achieving state-of-the-art performance in various benchmarks. AI-generated summary In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN , respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.23044
• PDF: https://arxiv.org/pdf/2506.23044
• Github: https://github.com/AIDC-AI/Ovis-U1
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B
• https://huggingface.co/spaces/evalstate/Ovis-U1-3B
• https://huggingface.co/spaces/LLMTestSaurav/Ovis-U1-Demo
• https://huggingface.co/spaces/innoai/Ovis-U1-3B-cpu
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Ovis-U1 Technical Report
In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the...
❤3
🔹 Title:
AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation
🔹 Publication Date: Published on Jul 23
🔹 Abstract:
The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with low-resolution logits and integrating high-frequency components, leading to improved segmentation performance. AI-generated summary In Unsupervised Domain Adaptive Semantic Segmentation ( UDA-SS ), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement ( AFR ) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits . AFR also integrates high-frequency components , which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods , leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia --> Cityscapes . The implementation of our framework is available at: https://github.com/Masrur02/ AFR DA
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.17957
• PDF: https://arxiv.org/pdf/2507.17957
• Github: https://github.com/Masrur02/AFRDA
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AFRDA: Attentive Feature Refinement for Domain Adaptive Semantic Segmentation
🔹 Publication Date: Published on Jul 23
🔹 Abstract:
The Adaptive Feature Refinement (AFR) module enhances unsupervised domain adaptive semantic segmentation by refining high-resolution features with low-resolution logits and integrating high-frequency components, leading to improved segmentation performance. AI-generated summary In Unsupervised Domain Adaptive Semantic Segmentation ( UDA-SS ), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real-world images) without access to target annotations. Existing UDA-SS methods often struggle to balance fine-grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement ( AFR ) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low-resolution logits . AFR also integrates high-frequency components , which capture fine-grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA-based UDA methods , leading to state-of-the-art segmentation performance. Our approach improves existing UDA-SS methods by 1.05% mIoU on GTA V --> Cityscapes and 1.04% mIoU on Synthia --> Cityscapes . The implementation of our framework is available at: https://github.com/Masrur02/ AFR DA
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.17957
• PDF: https://arxiv.org/pdf/2507.17957
• Github: https://github.com/Masrur02/AFRDA
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
Masrur02 - Overview
Robotics enthusiastic. Masrur02 has 16 repositories available. Follow their code on GitHub.
❤1
Forwarded from Python | Machine Learning | Coding | R
5 minutes of work - 127,000$ profit!
Opened access to the Jay Welcome Club where the AI bot does all the work itself💻
Usually you pay crazy money to get into this club, but today access is free for everyone!
23,432% on deposit earned by club members in the last 6 months📈
Just follow Jay's trades and earn! 👇
https://t.iss.one/+mONXtEgVxtU5NmZl
Opened access to the Jay Welcome Club where the AI bot does all the work itself💻
Usually you pay crazy money to get into this club, but today access is free for everyone!
23,432% on deposit earned by club members in the last 6 months📈
Just follow Jay's trades and earn! 👇
https://t.iss.one/+mONXtEgVxtU5NmZl
🔹 Title:
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
🔹 Publication Date: Published on May 23, 2024
🔹 Abstract:
A novel Offline RL method uses Contrastive Predictive Coding to handle non-stationary transition and reward functions in datasets, outperforming baselines in various control tasks. AI-generated summary Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks . We show that our method often achieves the oracle performance and performs better than baselines.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2405.14114
• PDF: https://arxiv.org/pdf/2405.14114
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/johannesack/OfflineRLStructuredNonstationary
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
🔹 Publication Date: Published on May 23, 2024
🔹 Abstract:
A novel Offline RL method uses Contrastive Predictive Coding to handle non-stationary transition and reward functions in datasets, outperforming baselines in various control tasks. AI-generated summary Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks . We show that our method often achieves the oracle performance and performs better than baselines.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2405.14114
• PDF: https://arxiv.org/pdf/2405.14114
🔹 Datasets citing this paper:
• https://huggingface.co/datasets/johannesack/OfflineRLStructuredNonstationary
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Title: MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
🔹 Publication Date:
Published on Jul 29
🔹 Abstract:
A systematic assessment of honesty in Multimodal Large Language Models (MLLMs) using a large-scale benchmark reveals that models often fail to appropriately refuse unanswerable visual questions, highlighting the need for multimodal honesty alignment methods. AI-generated summary Recently Multimodal Large Language Models ( MLLMs ) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs . We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench , a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples , whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench , we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs . Our data and code can be found at https://github.com/DSTTSD/ MoHoBench .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21503
• PDF: https://arxiv.org/pdf/2507.21503
• Github: https://github.com/DSTTSD/MoHoBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date:
Published on Jul 29
🔹 Abstract:
A systematic assessment of honesty in Multimodal Large Language Models (MLLMs) using a large-scale benchmark reveals that models often fail to appropriately refuse unanswerable visual questions, highlighting the need for multimodal honesty alignment methods. AI-generated summary Recently Multimodal Large Language Models ( MLLMs ) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs . We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench , a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples , whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench , we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs . Our data and code can be found at https://github.com/DSTTSD/ MoHoBench .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.21503
• PDF: https://arxiv.org/pdf/2507.21503
• Github: https://github.com/DSTTSD/MoHoBench
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
🔹 Title: X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
🔹 Publication Date:
Published on Jul 29
🔹 Abstract:
Reinforcement learning enhances discrete autoregressive modeling for image and language generation, achieving high-quality image generation and instruction-following capabilities. AI-generated summary Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity , distorted outputs , and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation . Our framework comprises a semantic image tokenizer , a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation , termed X-Omni . X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.22058
• PDF: https://arxiv.org/pdf/2507.22058
• Project Page: https://x-omni-team.github.io
• Github: https://github.com/X-Omni-Team/X-Omni
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/zhangxiaosong18/X-Omni-En
• https://huggingface.co/spaces/zhangxiaosong18/X-Omni-Zh
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
🔹 Publication Date:
Published on Jul 29
🔹 Abstract:
Reinforcement learning enhances discrete autoregressive modeling for image and language generation, achieving high-quality image generation and instruction-following capabilities. AI-generated summary Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity , distorted outputs , and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation . Our framework comprises a semantic image tokenizer , a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation , termed X-Omni . X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2507.22058
• PDF: https://arxiv.org/pdf/2507.22058
• Project Page: https://x-omni-team.github.io
• Github: https://github.com/X-Omni-Team/X-Omni
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/zhangxiaosong18/X-Omni-En
• https://huggingface.co/spaces/zhangxiaosong18/X-Omni-Zh
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates
🔹 Publication Date: Published on Mar 10
🔹 Abstract:
FedRand framework enhances data privacy in federated learning by keeping a subset of LoRA parameters private, reducing the risk of membership inference attacks while maintaining model accuracy. AI-generated summary Federated Learning (FL) is a widely used framework for training models in a decentralized manner, ensuring that the central server does not have direct access to data from local clients. However, this approach may still fail to fully preserve data privacy, as models from local clients are exposed to the central server during the aggregation process. This issue becomes even more critical when training vision-language models (VLMs) with FL, as VLMs can easily memorize training data instances, making them vulnerable to membership inference attacks (MIAs). To address this challenge, we propose the FedRand framework, which avoids disclosing the full set of client parameters. In this framework, each client randomly selects subparameters of Low-Rank Adaptation ( LoRA ) from the server and keeps the remaining counterparts of the LoRA weights as private parameters. After training both parameters on the client's private dataset, only the non-private client parameters are sent back to the server for aggregation. This approach mitigates the risk of exposing client-side VLM parameters, thereby enhancing data privacy. We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines while achieving accuracy comparable to methods that communicate full LoRA parameters across several benchmark datasets.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.07216
• PDF: https://arxiv.org/pdf/2503.07216
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates
🔹 Publication Date: Published on Mar 10
🔹 Abstract:
FedRand framework enhances data privacy in federated learning by keeping a subset of LoRA parameters private, reducing the risk of membership inference attacks while maintaining model accuracy. AI-generated summary Federated Learning (FL) is a widely used framework for training models in a decentralized manner, ensuring that the central server does not have direct access to data from local clients. However, this approach may still fail to fully preserve data privacy, as models from local clients are exposed to the central server during the aggregation process. This issue becomes even more critical when training vision-language models (VLMs) with FL, as VLMs can easily memorize training data instances, making them vulnerable to membership inference attacks (MIAs). To address this challenge, we propose the FedRand framework, which avoids disclosing the full set of client parameters. In this framework, each client randomly selects subparameters of Low-Rank Adaptation ( LoRA ) from the server and keeps the remaining counterparts of the LoRA weights as private parameters. After training both parameters on the client's private dataset, only the non-private client parameters are sent back to the server for aggregation. This approach mitigates the risk of exposing client-side VLM parameters, thereby enhancing data privacy. We empirically validate that FedRand improves robustness against MIAs compared to relevant baselines while achieving accuracy comparable to methods that communicate full LoRA parameters across several benchmark datasets.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2503.07216
• PDF: https://arxiv.org/pdf/2503.07216
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2