✨Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
📝 Summary:
This paper introduces process-driven image generation, an iterative method with interleaved textual and visual reasoning. It decomposes synthesis into planning, drafting, reflecting, and refining steps. Dense step-wise supervision ensures consistency and interpretability of intermediate states.
🔹 Publication Date: Published on Apr 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04746
• PDF: https://arxiv.org/pdf/2604.04746
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageGeneration #GenerativeAI #ArtificialIntelligence #DeepLearning #ComputerVision
📝 Summary:
This paper introduces process-driven image generation, an iterative method with interleaved textual and visual reasoning. It decomposes synthesis into planning, drafting, reflecting, and refining steps. Dense step-wise supervision ensures consistency and interpretability of intermediate states.
🔹 Publication Date: Published on Apr 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04746
• PDF: https://arxiv.org/pdf/2604.04746
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageGeneration #GenerativeAI #ArtificialIntelligence #DeepLearning #ComputerVision
✨VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
✨RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.
🔹 Publication Date: Published on Apr 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI
📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.
🔹 Publication Date: Published on Apr 8
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI
✨CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
This media is not supported in your browser
VIEW IN TELEGRAM
✨WildDet3D: Scaling Promptable 3D Detection in the Wild
📝 Summary:
WildDet3D is a unified architecture for open-world 3D object detection, accepting multiple prompt types and integrating geometric cues. It leverages WildDet3D-Data, the largest 3D dataset, to achieve state-of-the-art performance across benchmarks, with significant gains from incorporating depth i...
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08626
• PDF: https://arxiv.org/pdf/2604.08626
• Project Page: https://allenai.github.io/WildDet3D/
• Github: https://github.com/allenai/WildDet3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DObjectDetection #ComputerVision #DeepLearning #AI #Datasets
📝 Summary:
WildDet3D is a unified architecture for open-world 3D object detection, accepting multiple prompt types and integrating geometric cues. It leverages WildDet3D-Data, the largest 3D dataset, to achieve state-of-the-art performance across benchmarks, with significant gains from incorporating depth i...
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08626
• PDF: https://arxiv.org/pdf/2604.08626
• Project Page: https://allenai.github.io/WildDet3D/
• Github: https://github.com/allenai/WildDet3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DObjectDetection #ComputerVision #DeepLearning #AI #Datasets
✨Structured Causal Video Reasoning via Multi-Objective Alignment
📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch
📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch
✨3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
📝 Summary:
3DTV is a feedforward network combining lightweight geometry and learning for real-time, robust sparse-view interpolation. It generates novel views efficiently without scene-specific optimization, making it practical for interactive applications.
🔹 Publication Date: Published on Apr 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.11211
• PDF: https://arxiv.org/pdf/2604.11211
• Project Page: https://stefanmschulz.github.io/3DTV_webpage/
• Github: https://github.com/StefanMSchulz/3DTV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ViewSynthesis #DeepLearning #ComputerVision #NeuralNetworks #RealTimeAI
📝 Summary:
3DTV is a feedforward network combining lightweight geometry and learning for real-time, robust sparse-view interpolation. It generates novel views efficiently without scene-specific optimization, making it practical for interactive applications.
🔹 Publication Date: Published on Apr 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.11211
• PDF: https://arxiv.org/pdf/2604.11211
• Project Page: https://stefanmschulz.github.io/3DTV_webpage/
• Github: https://github.com/StefanMSchulz/3DTV
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ViewSynthesis #DeepLearning #ComputerVision #NeuralNetworks #RealTimeAI
✨ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch
📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch
✨VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.
🔹 Publication Date: Published on Apr 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #VFX #AI #ComputerVision #Benchmarks
📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.
🔹 Publication Date: Published on Apr 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #VFX #AI #ComputerVision #Benchmarks
✨NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results
📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.
🔹 Publication Date: Published on Apr 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction
📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.
🔹 Publication Date: Published on Apr 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction
✨Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.
🔹 Publication Date: Published on Apr 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP
📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.
🔹 Publication Date: Published on Apr 14
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP
Media is too big
VIEW IN TELEGRAM
✨CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...
🔹 Publication Date: Published on Apr 21
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation
📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...
🔹 Publication Date: Published on Apr 21
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation
This media is not supported in your browser
VIEW IN TELEGRAM
✨DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning
📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...
🔹 Publication Date: Published on Apr 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning
✨3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch
📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.
🔹 Publication Date: Published on Apr 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch
arXiv.org
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through...
Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded...