Data Science | Machine Learning with Python for Researchers
31.7K subscribers
1.92K photos
102 videos
22 files
2.19K links
Admin: @HusseinSheikho

The Data Science and Python channel is for researchers and advanced programmers

Buy ads: https://telega.io/c/dataScienceT
Download Telegram
πŸ”Ή Title:
ECO: Ensembling Context Optimization for Vision-Language Models

πŸ”Ή Publication Date: Published on Jul 26, 2023

πŸ”Ή Abstract:
Learning an ensemble of prompts enhances few-shot image classification using vision-language models like CLIP without increasing inference costs. AI-generated summary Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts . Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space . This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time . We demonstrate the capabilities of our approach on 11 different benchmarks.

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2307.14063
β€’ PDF: https://arxiv.org/pdf/2307.14063

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀4
Article Title:
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Article Date: Xuanchi Ren

Article Description:
We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/PDFAbstractCVPR 2025 PDFCVPR 2025 Abstract

PDF Download Link:
https://arxiv.org/pdf/2503.03751v1.pdf

GitHub:
β€’ https://github.com/nv-tlabs/GEN3C

Datasets:
β€’ Waymo Open Dataset
β€’ Kubric
β€’ RealEstate10K
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀2
πŸ”Ή Title:
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner

πŸ”Ή Publication Date: Published on Jun 10

πŸ”Ή Abstract:
A novel data synthesis framework, SWE-Flow, uses unit tests to automatically infer development steps and generate a structured schedule for Test-Driven Development (TDD), significantly improving the performance of open models fine-tuned on real-world projects. AI-generated summary We introduce ** SWE-Flow **, a novel data synthesis framework grounded in Test-Driven Development (TDD) . Unlike existing software engineering data that rely on human-submitted issues, ** SWE-Flow ** automatically infers incremental development steps directly from unit tests , which inherently encapsulate high-level requirements. The core of ** SWE-Flow ** is the construction of a Runtime Dependency Graph (RDG) , which precisely captures function interactions, enabling the generation of a structured, step-by-step * development schedule *. At each step, ** SWE-Flow ** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the ** SWE-Flow-Eval ** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/ SWE-Flow ).

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.09003
β€’ PDF: https://arxiv.org/pdf/2506.09003
β€’ Github: https://github.com/Hambaobao/SWE-Flow

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀2
πŸ”Ή Title:
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

πŸ”Ή Publication Date: Published on Jun 11

πŸ”Ή Abstract:
InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool. AI-generated summary Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs , primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn , a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge , an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content , image content , image quality , and image-text synergy . Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn 's utility for advancing multimodal systems.

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.09427
β€’ PDF: https://arxiv.org/pdf/2506.09427

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀1
πŸ”Ή Title: CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects

πŸ”Ή Publication Date:
Published on May 27

πŸ”Ή Abstract:
A coordinated diffusion noise optimization framework improves whole-body manipulation of articulated objects by leveraging specialized diffusion models for body and hand motions and a unified basis point set representation for precise hand-object interaction. AI-generated summary Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility , and enables various capabilities such as object pose control, simultaneous walking and manipulation , and whole-body generation from hand-only data.

πŸ”Ή Links:
- arXiv Page: https://arxiv.org/abs/2505.21437
- PDF: https://arxiv.org/pdf/2505.21437
- Project Page: https://phj128.github.io/page/CoDA/index.html
- Github: https://phj128.github.io/page/CoDA/index.html

πŸ”Ή Models citing this paper:
No models found

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
❀3
πŸ”Ή Title:
Aligning Text, Images, and 3D Structure Token-by-Token

πŸ”Ή Publication Date: Published on Jun 9

πŸ”Ή Abstract:
A unified language, image, and 3D scene model framework is proposed, achieving optimal training and performance across various 3D tasks and datasets. AI-generated summary Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives , and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following , and question-answering -- and four 3D datasets , synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings , and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.08002
β€’ PDF: https://arxiv.org/pdf/2506.08002
β€’ Project Page: https://glab-caltech.github.io/kyvo/
β€’ Github: https://glab-caltech.github.io/kyvo/

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀3πŸ‘3
Article Title:
Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs

Article Date: 23 Aug 2023

Article Description:
Despite the progress of foundation models, knowledge-based reasoning remains a persistent challenge due to their limited capacity for knowledge recall and inference. Existing methods primarily focus on encouraging these models to plan and solve problems or extensively sample reasoning chains independently. However, these methods often overlook conceptual errors and inferential fallacies, inevitably leading to a series of notorious issues such as misleading conclusions, cognitive biases, and reduced decision quality. While explicit modeling of causality is argued to hold promise in addressing these issues, contemporary research efforts have thus far fallen short in achieving causality-based foundation models. Drawing inspiration from the orchestration of diverse specialized agents collaborating to tackle intricate tasks, we propose a framework named Causal-Consistency Chain-of-Thought (CaCo-CoT) that harnesses multi-agent collaboration to bolster the faithfulness and causality of foundation models, involving a set of reasoners and evaluators. These agents collaboratively work within a reasoning-and-consensus paradigm to improve faithfulness. The reasoners are tasked with generating reasoning chains for knowledge-intensive problems by mimicking human causal reasoning. Meanwhile, the evaluator scrutinizes the causal consistency of a reasoner's reasoning chain from a non-causal and a counterfactual perspective. Our framework demonstrates significant superiority over state-of-the-art methods through extensive and comprehensive evaluations across text-based and multi-modal knowledge reasoning tasks (e.g., science question answering and commonsense reasoning).PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2308.11914v4.pdf

GitHub:
β€’ https://github.com/hcplab-sysu/causal-vlreasoning
β€’ https://github.com/hcplab-sysu/causalvlr

Datasets:
β€’ BoolQ
β€’ ScienceQA
β€’ Com2Sense
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀6
Article Title:
From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Article Date: 5 Jun 2025

Article Description:
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.05274v1.pdf

GitHub:
β€’ https://github.com/ucf-crcv/tf-covr

Datasets:
β€’ Fashion IQ
β€’ FineGym
β€’ CIRCO
β€’ FineDiving
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀6
Article Title:
Multiple Object Stitching for Unsupervised Representation Learning

Article Date: 9 Jun 2025

Article Description:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.07364v1.pdf

GitHub:
β€’ https://github.com/visresearch/MultipleObjectStitching

Datasets:
β€’ No datasets information available
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀3
Article Title:
HtFLlib: A Comprehensive Heterogeneous Federated Learning Library and Benchmark

Article Date: 4 Jun 2025

Article Description:
As AI evolves, collaboration among heterogeneous models helps overcome data scarcity by enabling knowledge transfer across institutions and devices. Traditional Federated Learning (FL) only supports homogeneous models, limiting collaboration among clients with heterogeneous model architectures. To address this, Heterogeneous Federated Learning (HtFL) methods are developed to enable collaboration across diverse heterogeneous models while tackling the data heterogeneity issue at the same time. However, a comprehensive benchmark for standardized evaluation and analysis of the rapidly growing HtFL methods is lacking. Firstly, the highly varied datasets, model heterogeneity scenarios, and different method implementations become hurdles to making easy and fair comparisons among HtFL methods. Secondly, the effectiveness and robustness of HtFL methods are under-explored in various scenarios, such as the medical domain and sensor signal modality. To fill this gap, we introduce the first Heterogeneous Federated Learning Library (HtFLlib), an easy-to-use and extensible framework that integrates multiple datasets and model heterogeneity scenarios, offering a robust benchmark for research and practical applications. Specifically, HtFLlib integrates (1) 12 datasets spanning various domains, modalities, and data heterogeneity scenarios; (2) 40 model architectures, ranging from small to large, across three modalities; (3) a modularized and easy-to-extend HtFL codebase with implementations of 10 representative HtFL methods; and (4) systematic evaluations in terms of accuracy, convergence, computation costs, and communication costs. We emphasize the advantages and potential of state-of-the-art HtFL methods and hope that HtFLlib will catalyze advancing HtFL research and enable its broader applications. The code is released at https://github.com/TsingZ0/HtFLlib.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.03954v1.pdf

GitHub:
β€’ https://github.com/tsingz0/htfllib
β€’ https://github.com/TsingZ0/GFL
β€’ https://github.com/TsingZ0/HtFL

Datasets:
β€’ CIFAR-10
β€’ CIFAR-100
β€’ Oxford 102 Flower
β€’ AG News
β€’ DomainNet
β€’ PAMAP2
β€’ COVIDx
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀2
πŸ”Ή Title:
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

πŸ”Ή Publication Date: Published on Jun 12

πŸ”Ή Abstract:
VRBench is a long narrative video benchmark designed to evaluate models' multi-step reasoning and procedural validity through human-labeled question-answering pairs and a human-AI collaborative framework with a multi-phase evaluation pipeline. AI-generated summary We present VRBench , the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity . It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains , each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution , implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench , we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning .

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.10857
β€’ PDF: https://arxiv.org/pdf/2506.10857
β€’ Project Page: https://vrbench.github.io/
β€’ Github: https://github.com/OpenGVLab/VRBench

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀4
❗️ WITH JAY MO YOU WILL START EARNING MONEY

Jay will leave a link with free entry to a channel that draws money every day. Each subscriber gets between $100 and $5,000.

πŸ‘‰πŸ»CLICK HERE TO JOIN THE CHANNEL πŸ‘ˆπŸ»
πŸ‘‰πŸ»CLICK HERE TO JOIN THE CHANNEL!πŸ‘ˆπŸ»
πŸ‘‰πŸ»CLICK HERE TO JOIN THE CHANNEL πŸ‘ˆπŸ»

🚨FREE FOR THE FIRST 500 SUBSCRIBERS ONLY!
❀2
Article Title:
VLMs Can Aggregate Scattered Training Patches

Article Date: 4 Jun 2025

Article Description:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2506.03614v1.pdf

GitHub:
β€’ https://github.com/zhziszz/visual-stitching

Datasets:
β€’ No datasets information available
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀4
πŸ”Ή Title:
Efficient Medical VIE via Reinforcement Learning

πŸ”Ή Publication Date: Published on Jun 16

πŸ”Ή Abstract:
An RLVR framework using fine-tuned Qwen2.5-VL-7B achieves state-of-the-art performance in medical VIE with limited annotated samples, enhancing reasoning and balance between precision and recall. AI-generated summary Visual Information Extraction (VIE) converts unstructured document images into structured formats like JSON, critical for medical applications such as report analysis and online consultations. Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct JSON generation. However, domain-specific schemas and high annotation costs limit their effectiveness in medical VIE. We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples. Our approach ensures dataset diversity , a balanced precision-recall reward mechanism to reduce hallucinations and improve field coverage , and innovative sampling strategies to enhance reasoning capabilities. Fine-tuning Qwen2.5-VL-7B with our RLVR method, we achieve state-of-the-art performance on medical VIE tasks, significantly improving F1, precision, and recall. While our models excel on tasks similar to medical datasets, performance drops on dissimilar tasks, highlighting the need for domain-specific optimization. Case studies further demonstrate the value of reasoning during training and inference for VIE.

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.13363
β€’ PDF: https://arxiv.org/pdf/2506.13363

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀2
Article Title:
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Article Date: 29 May 2025

Article Description:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.23628v1.pdf

GitHub:
β€’ https://github.com/hkust-knowcomp/autoschemakg

Datasets:
β€’ MML
β€’ MMLU
β€’ HotpotQA
β€’ YAGO
β€’ WikiHow
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀5
πŸ”Ή Title:
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

πŸ”Ή Publication Date: Published on Jun 12

πŸ”Ή Abstract:
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding. AI-generated summary In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT) -based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions . Project page: https://submit2025-dream.github.io/DreamActor-H1/.

πŸ”Ή Links:
β€’ arXiv Page: https://arxiv.org/abs/2506.10568
β€’ PDF: https://arxiv.org/pdf/2506.10568
β€’ Github: https://submit2025-dream.github.io/DreamActor-H1/

πŸ”Ή Datasets citing this paper:
No datasets found

πŸ”Ή Spaces citing this paper:
No spaces found
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀1
Article Title:
MAGREF: Masked Guidance for Any-Reference Video Generation

Article Date: 29 May 2025

Article Description:
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREFPDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2505.23742v1.pdf

GitHub:
β€’ https://github.com/magref-video/magref

Datasets:
β€’ No datasets information available
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀2
Article Title:
RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification

Article Date: 12 Mar 2025

Article Description:
In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.PDFAbstract

PDF Download Link:
https://arxiv.org/pdf/2503.09033v2.pdf

GitHub:
β€’ https://github.com/kitoweeknd/RFUAV

Datasets:
β€’ RFUAV
==================================

For more data science resources:

βœ“ https://t.iss.one/DataScienceT
❀1
πŸš€ FREE IT Study Kits for 2025 β€” Grab Yours Now!

Just found these zero-cost resources from SPOTOπŸ‘‡
Perfect if you're prepping for #Cisco, #AWS, #PMP, #AI, #Python, #Excel, or #Cybersecurity!
βœ… 100% Free
βœ… No signup traps
βœ… Instantly downloadable

πŸ“˜ IT Certs E-book: https://bit.ly/4fJSoLP
☁️ Cloud & AI Kits: https://bit.ly/3F3lc5B
πŸ“Š Cybersecurity, Python & Excel: https://bit.ly/4mFrA4g
🧠 Skill Test (Free!): https://bit.ly/3PoKH39
Tag a friend & level up together πŸ’ͺ

🌐 Join the IT Study Group: https://chat.whatsapp.com/E3Vkxa19HPO9ZVkWslBO8s
πŸ“² 1-on-1 Exam Help: https://wa.link/k0vy3x
πŸ‘‘Last 24 HOURS to grab Mid-Year Mega Sale prices!Don’t miss Lucky DrawπŸ‘‡
https://bit.ly/43VgcbT