Article Title:
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Article Date: 28 May 2025
Article Description:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22019v1.pdf
GitHub:
• https://github.com/alibaba-nlp/vrag
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Article Date: 28 May 2025
Article Description:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at \hyperlink{https://github.com/Alibaba-NLP/VRAG}{https://github.com/Alibaba-NLP/VRAG}.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.22019v1.pdf
GitHub:
• https://github.com/alibaba-nlp/vrag
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
GitHub - Alibaba-NLP/VRAG: Repo for "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via…
Repo for "VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning" - Alibaba-NLP/VRAG
❤5
🔹 Title:
Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text
🔹 Publication Date: Published on Sep 24, 2021
🔹 Abstract:
Analysis reveals that BERT is robust to natural linguistic variations but insensitive to the removal of clinically important information in text for Alzheimer's disease prediction. AI-generated summary Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2109.11888
• PDF: https://arxiv.org/pdf/2109.11888
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/Jekaterina/bert-robustness
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text
🔹 Publication Date: Published on Sep 24, 2021
🔹 Abstract:
Analysis reveals that BERT is robust to natural linguistic variations but insensitive to the removal of clinically important information in text for Alzheimer's disease prediction. AI-generated summary Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2109.11888
• PDF: https://arxiv.org/pdf/2109.11888
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
• https://huggingface.co/spaces/Jekaterina/bert-robustness
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Robustness and Sensitivity of BERT Models Predicting...
Understanding robustness and sensitivity of BERT models predicting Alzheimer's disease from text is important for both developing better classification models and for understanding their...
❤5
🔹 Title:
The Diffusion Duality
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
Duo improves uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, enhancing training speed and enabling fast few-step text generation. AI-generated summary Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation , which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.github.io/duo
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10892
• PDF: https://arxiv.org/pdf/2506.10892
• Project Page: https://s-sahoo.com/duo/
• Github: https://github.com/s-sahoo/duo
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
The Diffusion Duality
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
Duo improves uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, enhancing training speed and enabling fast few-step text generation. AI-generated summary Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation , which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: https://s-sahoo.github.io/duo
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10892
• PDF: https://arxiv.org/pdf/2506.10892
• Project Page: https://s-sahoo.com/duo/
• Github: https://github.com/s-sahoo/duo
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
ECO: Ensembling Context Optimization for Vision-Language Models
🔹 Publication Date: Published on Jul 26, 2023
🔹 Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2307.14063
• PDF: https://arxiv.org/pdf/2307.14063
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
ECO: Ensembling Context Optimization for Vision-Language Models
🔹 Publication Date: Published on Jul 26, 2023
🔹 Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2307.14063
• PDF: https://arxiv.org/pdf/2307.14063
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
ECO: Ensembling Context Optimization for Vision-Language Models
Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has...
❤4
Article Title:
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Article Date: Xuanchi Ren
Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf
GitHub:
• https://github.com/nv-tlabs/GEN3C
Datasets:
• Waymo Open Dataset
• Kubric
• RealEstate10K
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Article Date: Xuanchi Ren
Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract
PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf
GitHub:
• https://github.com/nv-tlabs/GEN3C
Datasets:
• Waymo Open Dataset
• Kubric
• RealEstate10K
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
🔹 Publication Date: Published on Jun 10
🔹 Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09003
• PDF: https://arxiv.org/pdf/2506.09003
• Github: https://github.com/Hambaobao/SWE-Flow
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
🔹 Publication Date: Published on Jun 10
🔹 Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09003
• PDF: https://arxiv.org/pdf/2506.09003
• Github: https://github.com/Hambaobao/SWE-Flow
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
GitHub
Hambaobao - Overview
My name is Lei Zhang, Ph.D. student of University of Chinese Academy of Sciences. - Hambaobao
❤2
🔹 Title:
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09427
• PDF: https://arxiv.org/pdf/2506.09427
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
🔹 Publication Date: Published on Jun 11
🔹 Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.09427
• PDF: https://arxiv.org/pdf/2506.09427
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤1
🔹 Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
🔹 Publication Date:
Published on May 27
🔹 Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
🔹 Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
🔹 Models citing this paper:
No models found
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
🔹 Publication Date:
Published on May 27
🔹 Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
🔹 Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
🔹 Models citing this paper:
No models found
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
❤3
🔹 Title:
Aligning Text, Images, and 3D Structure Token-by-Token
🔹 Publication Date: Published on Jun 9
🔹 Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.08002
• PDF: https://arxiv.org/pdf/2506.08002
• Project Page: https://glab-caltech.github.io/kyvo/
• Github: https://glab-caltech.github.io/kyvo/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Aligning Text, Images, and 3D Structure Token-by-Token
🔹 Publication Date: Published on Jun 9
🔹 Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.08002
• PDF: https://arxiv.org/pdf/2506.08002
• Project Page: https://glab-caltech.github.io/kyvo/
• Github: https://glab-caltech.github.io/kyvo/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3👍3
Article Title:
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causal-vlreasoning
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
• https://github.com/hcplab-sysu/causal-vlreasoning
• https://github.com/hcplab-sysu/causalvlr
Datasets:
• BoolQ
• ScienceQA
• Com2Sense
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6
Article Title:
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Article Date: 5 Jun 2025
Article Description:
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05274v1.pdf
GitHub:
• https://github.com/ucf-crcv/tf-covr
Datasets:
• Fashion IQ
• FineGym
• CIRCO
• FineDiving
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Article Date: 5 Jun 2025
Article Description:
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05274v1.pdf
GitHub:
• https://github.com/ucf-crcv/tf-covr
Datasets:
• Fashion IQ
• FineGym
• CIRCO
• FineDiving
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤6
Article Title:
Multiple Object Stitching for Unsupervised Representation Learning
Article Date: 9 Jun 2025
Article Description:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07364v1.pdf
GitHub:
• https://github.com/visresearch/MultipleObjectStitching
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Multiple Object Stitching for Unsupervised Representation Learning
Article Date: 9 Jun 2025
Article Description:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07364v1.pdf
GitHub:
• https://github.com/visresearch/MultipleObjectStitching
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤3
Article Title:
HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark
Article Date: 4 Jun 2025
Article Description:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03954v1.pdf
GitHub:
• https://github.com/tsingz0/htfllib
• https://github.com/TsingZ0/GFL
• https://github.com/TsingZ0/HtFL
Datasets:
• CIFAR-10
• CIFAR-100
• Oxford 102 Flower
• AG News
• DomainNet
• PAMAP2
• COVIDx
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark
Article Date: 4 Jun 2025
Article Description:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03954v1.pdf
GitHub:
• https://github.com/tsingz0/htfllib
• https://github.com/TsingZ0/GFL
• https://github.com/TsingZ0/HtFL
Datasets:
• CIFAR-10
• CIFAR-100
• Oxford 102 Flower
• AG News
• DomainNet
• PAMAP2
• COVIDx
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤2
🔹 Title:
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
VRBench is a long narrative video benchmark designed to evaluate models' multi-step reasoning and procedural validity through human-labeled question-answering pairs and a human-AI collaborative framework with a multi-phase evaluation pipeline. AI-generated summary We present VRBench , the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity . It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains , each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution , implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench , we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10857
• PDF: https://arxiv.org/pdf/2506.10857
• Project Page: https://vrbench.github.io/
• Github: https://github.com/OpenGVLab/VRBench
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
VRBench is a long narrative video benchmark designed to evaluate models' multi-step reasoning and procedural validity through human-labeled question-answering pairs and a human-AI collaborative framework with a multi-phase evaluation pipeline. AI-generated summary We present VRBench , the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity . It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains , each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution , implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench , we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning .
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10857
• PDF: https://arxiv.org/pdf/2506.10857
• Project Page: https://vrbench.github.io/
• Github: https://github.com/OpenGVLab/VRBench
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook...
❤4
❗️ WITH JAY MO YOU WILL START EARNING MONEY
Jay will leave a link with free entry to a channel that draws money every day. Each subscriber gets between $100 and $5,000.
👉🏻CLICK HERE TO JOIN THE CHANNEL 👈🏻
👉🏻CLICK HERE TO JOIN THE CHANNEL!👈🏻
👉🏻CLICK HERE TO JOIN THE CHANNEL 👈🏻
🚨FREE FOR THE FIRST 500 SUBSCRIBERS ONLY!
Jay will leave a link with free entry to a channel that draws money every day. Each subscriber gets between $100 and $5,000.
👉🏻CLICK HERE TO JOIN THE CHANNEL 👈🏻
👉🏻CLICK HERE TO JOIN THE CHANNEL!👈🏻
👉🏻CLICK HERE TO JOIN THE CHANNEL 👈🏻
🚨FREE FOR THE FIRST 500 SUBSCRIBERS ONLY!
❤2
Article Title:
VLMs Can Aggregate Scattered Training Patches
Article Date: 4 Jun 2025
Article Description:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03614v1.pdf
GitHub:
• https://github.com/zhziszz/visual-stitching
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
VLMs Can Aggregate Scattered Training Patches
Article Date: 4 Jun 2025
Article Description:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03614v1.pdf
GitHub:
• https://github.com/zhziszz/visual-stitching
Datasets:
• No datasets information available
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤4
🔹 Title:
Efficient Medical VIE via Reinforcement Learning
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
An RLVR framework using fine-tuned Qwen2.5-VL-7B achieves state-of-the-art performance in medical VIE with limited annotated samples, enhancing reasoning and balance between precision and recall. AI-generated summary Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity , a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage , and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13363
• PDF: https://arxiv.org/pdf/2506.13363
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
Efficient Medical VIE via Reinforcement Learning
🔹 Publication Date: Published on Jun 16
🔹 Abstract:
An RLVR framework using fine-tuned Qwen2.5-VL-7B achieves state-of-the-art performance in medical VIE with limited annotated samples, enhancing reasoning and balance between precision and recall. AI-generated summary Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity , a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage , and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.13363
• PDF: https://arxiv.org/pdf/2506.13363
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
arXiv.org
Efficient Medical VIE via Reinforcement Learning
Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations....
❤2
Article Title:
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Article Date: 29 May 2025
Article Description:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23628v1.pdf
GitHub:
• https://github.com/hkust-knowcomp/autoschemakg
Datasets:
• MML
• MMLU
• HotpotQA
• YAGO
• WikiHow
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Article Date: 29 May 2025
Article Description:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23628v1.pdf
GitHub:
• https://github.com/hkust-knowcomp/autoschemakg
Datasets:
• MML
• MMLU
• HotpotQA
• YAGO
• WikiHow
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
❤5
🔹 Title:
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding. AI-generated summary In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT) -based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions . Project page: https://submit2025-dream.github.io/DreamActor-H1/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10568
• PDF: https://arxiv.org/pdf/2506.10568
• Github: https://submit2025-dream.github.io/DreamActor-H1/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
🔹 Publication Date: Published on Jun 12
🔹 Abstract:
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding. AI-generated summary In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT) -based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions . Project page: https://submit2025-dream.github.io/DreamActor-H1/.
🔹 Links:
• arXiv Page: https://arxiv.org/abs/2506.10568
• PDF: https://arxiv.org/pdf/2506.10568
• Github: https://submit2025-dream.github.io/DreamActor-H1/
🔹 Datasets citing this paper:
No datasets found
🔹 Spaces citing this paper:
No spaces found
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
submit2025-dream.github.io
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers.
❤1