✨Video-CoE: Reinforcing Video Event Prediction via Chain of Events
📝 Summary:
Video-CoE introduces a Chain of Events CoE paradigm to improve video event prediction. It addresses MLLM limitations in logical reasoning and visual utilization by constructing temporal event chains and using enhanced training. CoE achieves state-of-the-art performance on VEP benchmarks.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14935
• PDF: https://arxiv.org/pdf/2603.14935
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEventPrediction #ChainOfEvents #MLLM #ComputerVision #AI
📝 Summary:
Video-CoE introduces a Chain of Events CoE paradigm to improve video event prediction. It addresses MLLM limitations in logical reasoning and visual utilization by constructing temporal event chains and using enhanced training. CoE achieves state-of-the-art performance on VEP benchmarks.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14935
• PDF: https://arxiv.org/pdf/2603.14935
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEventPrediction #ChainOfEvents #MLLM #ComputerVision #AI
This media is not supported in your browser
VIEW IN TELEGRAM
✨Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
📝 Summary:
CHROMM is a unified framework that jointly reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view videos. It integrates strong priors, handles scale discrepancies, and uses multi-view fusion for faster, more robust human-scene reconstruction.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12789
• PDF: https://arxiv.org/pdf/2603.12789
• Project Page: https://nstar1125.github.io/chromm
• Github: https://nstar1125.github.io/chromm/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #HumanSceneReconstruction #MultiViewVideo #AIResearch
📝 Summary:
CHROMM is a unified framework that jointly reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view videos. It integrates strong priors, handles scale discrepancies, and uses multi-view fusion for faster, more robust human-scene reconstruction.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12789
• PDF: https://arxiv.org/pdf/2603.12789
• Project Page: https://nstar1125.github.io/chromm
• Github: https://nstar1125.github.io/chromm/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #HumanSceneReconstruction #MultiViewVideo #AIResearch
✨V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...
🔹 Publication Date: Published on Mar 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding
📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...
🔹 Publication Date: Published on Mar 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding
✨Prompt-Free Universal Region Proposal Network
📝 Summary:
PF-RPN is a novel network that identifies potential objects without needing external prompts, improving flexibility. It uses Sparse Image-Aware Adapters and Cascade Self-Prompting to localize objects, validated across 19 datasets. This method works across diverse domains with limited data.
🔹 Publication Date: Published on Mar 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17554
• PDF: https://arxiv.org/pdf/2603.17554
• Github: https://github.com/tangqh03/PF-RPN
🔹 Models citing this paper:
• https://huggingface.co/tangqh/PF-RPN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ObjectDetection #ComputerVision #DeepLearning #RPN #PromptFreeAI
📝 Summary:
PF-RPN is a novel network that identifies potential objects without needing external prompts, improving flexibility. It uses Sparse Image-Aware Adapters and Cascade Self-Prompting to localize objects, validated across 19 datasets. This method works across diverse domains with limited data.
🔹 Publication Date: Published on Mar 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17554
• PDF: https://arxiv.org/pdf/2603.17554
• Github: https://github.com/tangqh03/PF-RPN
🔹 Models citing this paper:
• https://huggingface.co/tangqh/PF-RPN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ObjectDetection #ComputerVision #DeepLearning #RPN #PromptFreeAI
✨EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
📝 Summary:
EffectErase is a new video object removal method that effectively erases dynamic objects and their visual effects. It introduces VOR, a large dataset for training, and uses reciprocal learning with task-aware guidance for high-quality results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19224
• PDF: https://arxiv.org/pdf/2603.19224
• Project Page: https://henghuiding.com/EffectErase/
• Github: https://github.com/FudanCVL/EffectErase
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #ComputerVision #ObjectRemoval #DeepLearning #AI
📝 Summary:
EffectErase is a new video object removal method that effectively erases dynamic objects and their visual effects. It introduces VOR, a large dataset for training, and uses reciprocal learning with task-aware guidance for high-quality results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19224
• PDF: https://arxiv.org/pdf/2603.19224
• Project Page: https://henghuiding.com/EffectErase/
• Github: https://github.com/FudanCVL/EffectErase
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #ComputerVision #ObjectRemoval #DeepLearning #AI
✨VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction
📝 Summary:
VID-AD is a dataset for logical anomaly detection in industrial inspection, specifically addressing challenges from visual distractions. A new language-based framework is also proposed, which uses text descriptions and contrastive learning to capture logical attributes.
🔹 Publication Date: Published on Mar 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13964
• PDF: https://arxiv.org/pdf/2603.13964
• Github: https://github.com/nkthiroto/VID-AD
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AnomalyDetection #IndustrialInspection #ComputerVision #MachineLearning #Datasets
📝 Summary:
VID-AD is a dataset for logical anomaly detection in industrial inspection, specifically addressing challenges from visual distractions. A new language-based framework is also proposed, which uses text descriptions and contrastive learning to capture logical attributes.
🔹 Publication Date: Published on Mar 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13964
• PDF: https://arxiv.org/pdf/2603.13964
• Github: https://github.com/nkthiroto/VID-AD
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AnomalyDetection #IndustrialInspection #ComputerVision #MachineLearning #Datasets
✨DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising
📝 Summary:
DreamPartGen generates 3D objects by modeling part geometry and appearance with Duplex Part Latents. It captures inter-part relationships using Relational Semantic Latents for improved text-shape alignment. A co-denoising process ensures consistency and achieves state-of-the-art results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19216
• PDF: https://arxiv.org/pdf/2603.19216
• Project Page: https://plan-lab.github.io/dreampartgen
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #GenerativeAI #DeepLearning #ComputerVision #TextTo3D
📝 Summary:
DreamPartGen generates 3D objects by modeling part geometry and appearance with Duplex Part Latents. It captures inter-part relationships using Relational Semantic Latents for improved text-shape alignment. A co-denoising process ensures consistency and achieves state-of-the-art results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19216
• PDF: https://arxiv.org/pdf/2603.19216
• Project Page: https://plan-lab.github.io/dreampartgen
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #GenerativeAI #DeepLearning #ComputerVision #TextTo3D
❤1
This media is not supported in your browser
VIEW IN TELEGRAM
✨WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
📝 Summary:
This research investigates if 2D foundation image models inherently possess 3D world modeling capabilities. It proposes an agentic framework to leverage this, demonstrating that 2D models can synthesize expansive, consistent 3D worlds.
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19708
• PDF: https://arxiv.org/pdf/2603.19708
• Project Page: https://ziyaerkoc.com/worldagents/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ComputerVision #3DWorldModels #GenerativeAI #FoundationModels
📝 Summary:
This research investigates if 2D foundation image models inherently possess 3D world modeling capabilities. It proposes an agentic framework to leverage this, demonstrating that 2D models can synthesize expansive, consistent 3D worlds.
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19708
• PDF: https://arxiv.org/pdf/2603.19708
• Project Page: https://ziyaerkoc.com/worldagents/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ComputerVision #3DWorldModels #GenerativeAI #FoundationModels
✨LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
📝 Summary:
LumosX enhances text-to-video generation by improving face-attribute alignment and subject consistency. It uses a new data pipeline to infer subject dependencies and Relational Attention mechanisms to explicitly link subjects with attributes, achieving state-of-the-art personalized multi-subject ...
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.20192
• PDF: https://arxiv.org/pdf/2603.20192
• Project Page: https://jiazheng-xing.github.io/lumosx-home/
• Github: https://github.com/alibaba-damo-academy/Lumos-Custom
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#TextToVideo #VideoGeneration #PersonalizedAI #ComputerVision #DeepLearning
📝 Summary:
LumosX enhances text-to-video generation by improving face-attribute alignment and subject consistency. It uses a new data pipeline to infer subject dependencies and Relational Attention mechanisms to explicitly link subjects with attributes, achieving state-of-the-art personalized multi-subject ...
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.20192
• PDF: https://arxiv.org/pdf/2603.20192
• Project Page: https://jiazheng-xing.github.io/lumosx-home/
• Github: https://github.com/alibaba-damo-academy/Lumos-Custom
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#TextToVideo #VideoGeneration #PersonalizedAI #ComputerVision #DeepLearning
✨Teaching an Agent to Sketch One Part at a Time
📝 Summary:
Researchers developed an agent that generates vector sketches incrementally, one part at a time. It uses a multi-modal language model and process-reward reinforcement learning with a new part-annotated dataset. This enables controllable and editable text-to-vector sketch generation.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19500
• PDF: https://arxiv.org/pdf/2603.19500
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #GenerativeAI #MachineLearning #ComputerVision #ReinforcementLearning
📝 Summary:
Researchers developed an agent that generates vector sketches incrementally, one part at a time. It uses a multi-modal language model and process-reward reinforcement learning with a new part-annotated dataset. This enables controllable and editable text-to-vector sketch generation.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19500
• PDF: https://arxiv.org/pdf/2603.19500
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #GenerativeAI #MachineLearning #ComputerVision #ReinforcementLearning
✨HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
📝 Summary:
HopChain is a framework that synthesizes multi-hop vision-language reasoning data to improve VLMs. This data features logically dependent reasoning chains, addressing VLMs' struggle with complex reasoning. Training with HopChain data significantly enhances generalizable VLM performance across div...
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17024
• PDF: https://arxiv.org/pdf/2603.17024
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VLMs #DataSynthesis #MultiHopReasoning #AIResearch #ComputerVision
📝 Summary:
HopChain is a framework that synthesizes multi-hop vision-language reasoning data to improve VLMs. This data features logically dependent reasoning chains, addressing VLMs' struggle with complex reasoning. Training with HopChain data significantly enhances generalizable VLM performance across div...
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17024
• PDF: https://arxiv.org/pdf/2603.17024
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VLMs #DataSynthesis #MultiHopReasoning #AIResearch #ComputerVision
✨TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
📝 Summary:
TerraScope is a new VLM for Earth Observation enabling pixel-grounded geospatial reasoning. It offers modality-flexible and multi-temporal capabilities, outperforming existing models on a new benchmark for accurate and interpretable results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19039
• PDF: https://arxiv.org/pdf/2603.19039
• Project Page: https://shuyansy.github.io/terrascope/
• Github: https://github.com/shuyansy/Earth-Observation-VLMs
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EarthObservation #VLM #Geospatial #RemoteSensing #ComputerVision
📝 Summary:
TerraScope is a new VLM for Earth Observation enabling pixel-grounded geospatial reasoning. It offers modality-flexible and multi-temporal capabilities, outperforming existing models on a new benchmark for accurate and interpretable results.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19039
• PDF: https://arxiv.org/pdf/2603.19039
• Project Page: https://shuyansy.github.io/terrascope/
• Github: https://github.com/shuyansy/Earth-Observation-VLMs
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#EarthObservation #VLM #Geospatial #RemoteSensing #ComputerVision
✨HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
📝 Summary:
HiMu is a training-free framework for long video QA. It efficiently selects relevant frames using hierarchical query decomposition with lightweight multimodal experts, preserving temporal and cross-modal structure. HiMu advances the efficiency-accuracy Pareto front.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.18558
• PDF: https://arxiv.org/pdf/2603.18558
• Project Page: https://danbenami.github.io/HiMu.io/
• Github: https://github.com/DanBenAmi/HiMu
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoQA #MultimodalAI #ComputerVision #MachineLearning #AI
📝 Summary:
HiMu is a training-free framework for long video QA. It efficiently selects relevant frames using hierarchical query decomposition with lightweight multimodal experts, preserving temporal and cross-modal structure. HiMu advances the efficiency-accuracy Pareto front.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.18558
• PDF: https://arxiv.org/pdf/2603.18558
• Project Page: https://danbenami.github.io/HiMu.io/
• Github: https://github.com/DanBenAmi/HiMu
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoQA #MultimodalAI #ComputerVision #MachineLearning #AI
✨Versatile Editing of Video Content, Actions, and Dynamics without Training
📝 Summary:
DynaEdit is a training-free method for versatile video editing using pretrained text-to-video models. It addresses limitations in handling complex edits, actions, and object interactions by solving technical issues like misalignment and jitter, achieving state-of-the-art results.
🔹 Publication Date: Published on Mar 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17989
• PDF: https://arxiv.org/pdf/2603.17989
• Project Page: https://dynaedit.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #TextToVideo #GenerativeAI #ComputerVision #AIResearch
📝 Summary:
DynaEdit is a training-free method for versatile video editing using pretrained text-to-video models. It addresses limitations in handling complex edits, actions, and object interactions by solving technical issues like misalignment and jitter, achieving state-of-the-art results.
🔹 Publication Date: Published on Mar 18
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17989
• PDF: https://arxiv.org/pdf/2603.17989
• Project Page: https://dynaedit.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #TextToVideo #GenerativeAI #ComputerVision #AIResearch
arXiv.org
Versatile Editing of Video Content, Actions, and Dynamics without Training
Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in...
✨From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
📝 Summary:
This paper shifts VLM image tampering detection from coarse object masks to pixel-level analysis with semantic understanding. It introduces a new taxonomy, benchmark, and metrics to evaluate both localization accuracy and the meaning of image modifications. This offers a more rigorous standard fo...
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.20193
• PDF: https://arxiv.org/pdf/2603.20193
• Github: https://github.com/VILA-Lab/PIXAR
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VLM #ImageTampering #DeepfakeDetection #ComputerVision #AIResearch
📝 Summary:
This paper shifts VLM image tampering detection from coarse object masks to pixel-level analysis with semantic understanding. It introduces a new taxonomy, benchmark, and metrics to evaluate both localization accuracy and the meaning of image modifications. This offers a more rigorous standard fo...
🔹 Publication Date: Published on Mar 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.20193
• PDF: https://arxiv.org/pdf/2603.20193
• Github: https://github.com/VILA-Lab/PIXAR
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VLM #ImageTampering #DeepfakeDetection #ComputerVision #AIResearch