πΉ Title:
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09427
β’ PDF: https://arxiv.org/pdf/2506.09427
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
πΉ Publication Date: Published on Jun 11
πΉ Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.09427
β’ PDF: https://arxiv.org/pdf/2506.09427
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
πΉ Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
πΉ Publication Date:
Published on May 27
πΉ Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
πΉ Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
πΉ Publication Date:
Published on May 27
πΉ Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.
πΉ Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html
πΉ Models citing this paper:
No models found
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
β€3
πΉ Title:
Aligning Text, Images, and 3D Structure Token-by-Token
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08002
β’ PDF: https://arxiv.org/pdf/2506.08002
β’ Project Page: https://glab-caltech.github.io/kyvo/
β’ Github: https://glab-caltech.github.io/kyvo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Aligning Text, Images, and 3D Structure Token-by-Token
πΉ Publication Date: Published on Jun 9
πΉ Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.08002
β’ PDF: https://arxiv.org/pdf/2506.08002
β’ Project Page: https://glab-caltech.github.io/kyvo/
β’ Github: https://glab-caltech.github.io/kyvo/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3π3
Article Title:
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
β’ https://github.com/hcplab-sysu/causal-vlreasoning
β’ https://github.com/hcplab-sysu/causalvlr
Datasets:
β’ BoolQ
β’ ScienceQA
β’ Com2Sense
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs
Article Date: 23 Aug 2023
Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf
GitHub:
β’ https://github.com/hcplab-sysu/causal-vlreasoning
β’ https://github.com/hcplab-sysu/causalvlr
Datasets:
β’ BoolQ
β’ ScienceQA
β’ Com2Sense
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€6
Article Title:
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Article Date: 5 Jun 2025
Article Description:
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05274v1.pdf
GitHub:
β’ https://github.com/ucf-crcv/tf-covr
Datasets:
β’ Fashion IQ
β’ FineGym
β’ CIRCO
β’ FineDiving
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Article Date: 5 Jun 2025
Article Description:
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.05274v1.pdf
GitHub:
β’ https://github.com/ucf-crcv/tf-covr
Datasets:
β’ Fashion IQ
β’ FineGym
β’ CIRCO
β’ FineDiving
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€6
Article Title:
Multiple Object Stitching for Unsupervised Representation Learning
Article Date: 9 Jun 2025
Article Description:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07364v1.pdf
GitHub:
β’ https://github.com/visresearch/MultipleObjectStitching
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Multiple Object Stitching for Unsupervised Representation Learning
Article Date: 9 Jun 2025
Article Description:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.07364v1.pdf
GitHub:
β’ https://github.com/visresearch/MultipleObjectStitching
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
Article Title:
HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark
Article Date: 4 Jun 2025
Article Description:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03954v1.pdf
GitHub:
β’ https://github.com/tsingz0/htfllib
β’ https://github.com/TsingZ0/GFL
β’ https://github.com/TsingZ0/HtFL
Datasets:
β’ CIFAR-10
β’ CIFAR-100
β’ Oxford 102 Flower
β’ AG News
β’ DomainNet
β’ PAMAP2
β’ COVIDx
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark
Article Date: 4 Jun 2025
Article Description:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03954v1.pdf
GitHub:
β’ https://github.com/tsingz0/htfllib
β’ https://github.com/TsingZ0/GFL
β’ https://github.com/TsingZ0/HtFL
Datasets:
β’ CIFAR-10
β’ CIFAR-100
β’ Oxford 102 Flower
β’ AG News
β’ DomainNet
β’ PAMAP2
β’ COVIDx
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
πΉ Title:
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
VRBench is a long narrative video benchmark designed to evaluate models' multi-step reasoning and procedural validity through human-labeled question-answering pairs and a human-AI collaborative framework with a multi-phase evaluation pipeline. AI-generated summary We present VRBench , the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity . It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains , each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution , implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench , we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10857
β’ PDF: https://arxiv.org/pdf/2506.10857
β’ Project Page: https://vrbench.github.io/
β’ Github: https://github.com/OpenGVLab/VRBench
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
VRBench is a long narrative video benchmark designed to evaluate models' multi-step reasoning and procedural validity through human-labeled question-answering pairs and a human-AI collaborative framework with a multi-phase evaluation pipeline. AI-generated summary We present VRBench , the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity . It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains , each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution , implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench , we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10857
β’ PDF: https://arxiv.org/pdf/2506.10857
β’ Project Page: https://vrbench.github.io/
β’ Github: https://github.com/OpenGVLab/VRBench
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook...
β€4
βοΈ WITH JAY MO YOU WILL START EARNING MONEY
Jay will leave a link with free entry to a channel that draws money every day. Each subscriber gets between $100 and $5,000.
ππ»CLICK HERE TO JOIN THE CHANNEL ππ»
ππ»CLICK HERE TO JOIN THE CHANNEL!ππ»
ππ»CLICK HERE TO JOIN THE CHANNEL ππ»
π¨FREE FOR THE FIRST 500 SUBSCRIBERS ONLY!
Jay will leave a link with free entry to a channel that draws money every day. Each subscriber gets between $100 and $5,000.
ππ»CLICK HERE TO JOIN THE CHANNEL ππ»
ππ»CLICK HERE TO JOIN THE CHANNEL!ππ»
ππ»CLICK HERE TO JOIN THE CHANNEL ππ»
π¨FREE FOR THE FIRST 500 SUBSCRIBERS ONLY!
β€2
Article Title:
VLMs Can Aggregate Scattered Training Patches
Article Date: 4 Jun 2025
Article Description:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03614v1.pdf
GitHub:
β’ https://github.com/zhziszz/visual-stitching
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
VLMs Can Aggregate Scattered Training Patches
Article Date: 4 Jun 2025
Article Description:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2506.03614v1.pdf
GitHub:
β’ https://github.com/zhziszz/visual-stitching
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€4
πΉ Title:
Efficient Medical VIE via Reinforcement Learning
πΉ Publication Date: Published on Jun 16
πΉ Abstract:
An RLVR framework using fine-tuned Qwen2.5-VL-7B achieves state-of-the-art performance in medical VIE with limited annotated samples, enhancing reasoning and balance between precision and recall. AI-generated summary Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity , a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage , and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.13363
β’ PDF: https://arxiv.org/pdf/2506.13363
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Efficient Medical VIE via Reinforcement Learning
πΉ Publication Date: Published on Jun 16
πΉ Abstract:
An RLVR framework using fine-tuned Qwen2.5-VL-7B achieves state-of-the-art performance in medical VIE with limited annotated samples, enhancing reasoning and balance between precision and recall. AI-generated summary Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity , a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage , and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.13363
β’ PDF: https://arxiv.org/pdf/2506.13363
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Efficient Medical VIE via Reinforcement Learning
Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations....
β€2
Article Title:
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Article Date: 29 May 2025
Article Description:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23628v1.pdf
GitHub:
β’ https://github.com/hkust-knowcomp/autoschemakg
Datasets:
β’ MML
β’ MMLU
β’ HotpotQA
β’ YAGO
β’ WikiHow
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Article Date: 29 May 2025
Article Description:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23628v1.pdf
GitHub:
β’ https://github.com/hkust-knowcomp/autoschemakg
Datasets:
β’ MML
β’ MMLU
β’ HotpotQA
β’ YAGO
β’ WikiHow
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€5
πΉ Title:
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding. AI-generated summary In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT) -based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions . Project page: https://submit2025-dream.github.io/DreamActor-H1/.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10568
β’ PDF: https://arxiv.org/pdf/2506.10568
β’ Github: https://submit2025-dream.github.io/DreamActor-H1/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
πΉ Publication Date: Published on Jun 12
πΉ Abstract:
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding. AI-generated summary In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT) -based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions . Project page: https://submit2025-dream.github.io/DreamActor-H1/.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.10568
β’ PDF: https://arxiv.org/pdf/2506.10568
β’ Github: https://submit2025-dream.github.io/DreamActor-H1/
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
submit2025-dream.github.io
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers.
β€1
Article Title:
MAGREF: Masked Guidance for Any-Reference Video Generation
Article Date: 29 May 2025
Article Description:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREFPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23742v1.pdf
GitHub:
β’ https://github.com/magref-video/magref
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
MAGREF: Masked Guidance for Any-Reference Video Generation
Article Date: 29 May 2025
Article Description:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREFPDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2505.23742v1.pdf
GitHub:
β’ https://github.com/magref-video/magref
Datasets:
β’ No datasets information available
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€2
Article Title:
RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification
Article Date: 12 Mar 2025
Article Description:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.09033v2.pdf
GitHub:
β’ https://github.com/kitoweeknd/RFUAV
Datasets:
β’ RFUAV
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification
Article Date: 12 Mar 2025
Article Description:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.PDFAbstract
PDF Download Link:
https://arxiv.org/pdf/2503.09033v2.pdf
GitHub:
β’ https://github.com/kitoweeknd/RFUAV
Datasets:
β’ RFUAV
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€1
Forwarded from Python | Machine Learning | Coding | R
π FREE IT Study Kits for 2025 β Grab Yours Now!
Just found these zero-cost resources from SPOTOπ
Perfect if you're prepping for #Cisco, #AWS, #PMP, #AI, #Python, #Excel, or #Cybersecurity!
β 100% Free
β No signup traps
β Instantly downloadable
π IT Certs E-book: https://bit.ly/4fJSoLP
βοΈ Cloud & AI Kits: https://bit.ly/3F3lc5B
π Cybersecurity, Python & Excel: https://bit.ly/4mFrA4g
π§ Skill Test (Free!): https://bit.ly/3PoKH39
Tag a friend & level up together πͺ
π Join the IT Study Group: https://chat.whatsapp.com/E3Vkxa19HPO9ZVkWslBO8s
π² 1-on-1 Exam Help: https://wa.link/k0vy3x
πLast 24 HOURS to grab Mid-Year Mega Sale pricesοΌDonβt miss Lucky Drawπ
https://bit.ly/43VgcbT
Just found these zero-cost resources from SPOTOπ
Perfect if you're prepping for #Cisco, #AWS, #PMP, #AI, #Python, #Excel, or #Cybersecurity!
β 100% Free
β No signup traps
β Instantly downloadable
π IT Certs E-book: https://bit.ly/4fJSoLP
βοΈ Cloud & AI Kits: https://bit.ly/3F3lc5B
π Cybersecurity, Python & Excel: https://bit.ly/4mFrA4g
π§ Skill Test (Free!): https://bit.ly/3PoKH39
Tag a friend & level up together πͺ
π Join the IT Study Group: https://chat.whatsapp.com/E3Vkxa19HPO9ZVkWslBO8s
π² 1-on-1 Exam Help: https://wa.link/k0vy3x
πLast 24 HOURS to grab Mid-Year Mega Sale pricesοΌDonβt miss Lucky Drawπ
https://bit.ly/43VgcbT
πΉ Title:
Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
πΉ Publication Date: Published on Jun 15
πΉ Abstract:
ChartIR uses structured instruction and iterative refinement to improve MLLM performance in chart-to-code generation by separating visual understanding and code translation tasks. AI-generated summary Recently, multimodal large language models ( MLLMs ) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction . First, we distinguish two tasks: visual understanding and code translation . To accomplish the visual understanding component, we design two types of structured instruction s: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations , thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement , enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.14837
β’ PDF: https://arxiv.org/pdf/2506.14837
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
πΉ Publication Date: Published on Jun 15
πΉ Abstract:
ChartIR uses structured instruction and iterative refinement to improve MLLM performance in chart-to-code generation by separating visual understanding and code translation tasks. AI-generated summary Recently, multimodal large language models ( MLLMs ) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction . First, we distinguish two tasks: visual understanding and code translation . To accomplish the visual understanding component, we design two types of structured instruction s: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations , thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement , enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o .
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.14837
β’ PDF: https://arxiv.org/pdf/2506.14837
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
Improved Iterative Refinement for Chart-to-Code Generation via...
Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results...
β€1
πΉ Title:
Show-o2: Improved Native Unified Multimodal Models
πΉ Publication Date: Published on Jun 18
πΉ Abstract:
Show-o2 leverages autoregressive modeling and flow matching within a 3D causal variational autoencoder to create unified visual representations for multimodal understanding and generation tasks. AI-generated summary This paper presents improved native unified multimodal models , i.e., Show-o2, that leverage autoregressive modeling and flow matching . Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model , autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation . A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.15564
β’ PDF: https://arxiv.org/pdf/2506.15564
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/showlab/Show-o
β’ https://huggingface.co/spaces/svjack/Show-o
β’ https://huggingface.co/spaces/showlab/Show-o-512
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Show-o2: Improved Native Unified Multimodal Models
πΉ Publication Date: Published on Jun 18
πΉ Abstract:
Show-o2 leverages autoregressive modeling and flow matching within a 3D causal variational autoencoder to create unified visual representations for multimodal understanding and generation tasks. AI-generated summary This paper presents improved native unified multimodal models , i.e., Show-o2, that leverage autoregressive modeling and flow matching . Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model , autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation . A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.15564
β’ PDF: https://arxiv.org/pdf/2506.15564
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
β’ https://huggingface.co/spaces/showlab/Show-o
β’ https://huggingface.co/spaces/svjack/Show-o
β’ https://huggingface.co/spaces/showlab/Show-o-512
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3π2
πΉ Title:
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
πΉ Publication Date: Published on Jun 16
πΉ Abstract:
Stream-Omni, a large multimodal model, integrates text, vision, and speech by efficiently aligning modalities using sequence-dimension concatenation for vision and layer-dimension mapping for speech, achieving strong performance with less data. AI-generated summary The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments . In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments . To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction , and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction , offering users a comprehensive multimodal experience.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.13642
β’ PDF: https://arxiv.org/pdf/2506.13642
β’ Project Page: https://github.com/ictnlp/Stream-Omni
β’ Github: https://github.com/ictnlp/Stream-Omni
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
πΉ Publication Date: Published on Jun 16
πΉ Abstract:
Stream-Omni, a large multimodal model, integrates text, vision, and speech by efficiently aligning modalities using sequence-dimension concatenation for vision and layer-dimension mapping for speech, achieving strong performance with less data. AI-generated summary The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments . In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments . To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction , and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction , offering users a comprehensive multimodal experience.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.13642
β’ PDF: https://arxiv.org/pdf/2506.13642
β’ Project Page: https://github.com/ictnlp/Stream-Omni
β’ Github: https://github.com/ictnlp/Stream-Omni
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
β€3
πΉ Title:
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
πΉ Publication Date: Published on Jun 13
πΉ Abstract:
LLMs perform well on implementation-heavy competitive programming problems but struggle with nuanced algorithmic reasoning, as highlighted by LiveCodeBench Pro. AI-generated summary Recent reports claim that large language models ( LLMs ) now outperform elite humans in competitive programming . Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro , a benchmark composed of problems from Codeforces , ICPC , and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis , often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.11928
β’ PDF: https://arxiv.org/pdf/2506.11928
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
πΉ Publication Date: Published on Jun 13
πΉ Abstract:
LLMs perform well on implementation-heavy competitive programming problems but struggle with nuanced algorithmic reasoning, as highlighted by LiveCodeBench Pro. AI-generated summary Recent reports claim that large language models ( LLMs ) now outperform elite humans in competitive programming . Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro , a benchmark composed of problems from Codeforces , ICPC , and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis , often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.
πΉ Links:
β’ arXiv Page: https://arxiv.org/abs/2506.11928
β’ PDF: https://arxiv.org/pdf/2506.11928
πΉ Datasets citing this paper:
No datasets found
πΉ Spaces citing this paper:
No spaces found
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
arXiv.org
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in...
Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests,...
β€3