ML Research Hub

✨Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

📝 Summary:
This paper introduces process-driven image generation, an iterative method with interleaved textual and visual reasoning. It decomposes synthesis into planning, drafting, reflecting, and refining steps. Dense step-wise supervision ensures consistency and interpretability of intermediate states.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04746
• PDF: https://arxiv.org/pdf/2604.04746

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ImageGeneration #GenerativeAI #ArtificialIntelligence #DeepLearning #ComputerVision

107 views04:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.

🔹 Publication Date: Published on May 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG

🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI

215 views09:56

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI

342 views02:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels

170 views02:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

0:00

This media is not supported in your browser

VIEW IN TELEGRAM

✨WildDet3D: Scaling Promptable 3D Detection in the Wild

📝 Summary:
WildDet3D is a unified architecture for open-world 3D object detection, accepting multiple prompt types and integrating geometric cues. It leverages WildDet3D-Data, the largest 3D dataset, to achieve state-of-the-art performance across benchmarks, with significant gains from incorporating depth i...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08626
• PDF: https://arxiv.org/pdf/2604.08626
• Project Page: https://allenai.github.io/WildDet3D/
• Github: https://github.com/allenai/WildDet3D

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#3DObjectDetection #ComputerVision #DeepLearning #AI #Datasets

180 views03:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Structured Causal Video Reasoning via Multi-Objective Alignment

📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...

🔹 Publication Date: Published on Apr 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch

176 views04:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

📝 Summary:
3DTV is a feedforward network combining lightweight geometry and learning for real-time, robust sparse-view interpolation. It generates novel views efficiently without scene-specific optimization, making it practical for interactive applications.

🔹 Publication Date: Published on Apr 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.11211
• PDF: https://arxiv.org/pdf/2604.11211
• Project Page: https://stefanmschulz.github.io/3DTV_webpage/
• Github: https://github.com/StefanMSchulz/3DTV

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ViewSynthesis #DeepLearning #ComputerVision #NeuralNetworks #RealTimeAI

206 views11:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch

188 views07:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoEditing #VFX #AI #ComputerVision #Benchmarks

179 views02:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.

🔹 Publication Date: Published on Apr 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction

149 views08:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.

🔹 Publication Date: Published on Apr 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP

219 views10:07

✨ Explore Data Science 📝 Write your paper

✨CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...

🔹 Publication Date: Published on Apr 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation

225 views14:08

✨ Explore Data Science 📝 Write your paper

ML Research Hub

0:08

This media is not supported in your browser

VIEW IN TELEGRAM

✨DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning

191 views09:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch

arXiv.org

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through...

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded...

187 views17:09

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform