ML Research Hub
32.3K subscribers
6.73K photos
472 videos
24 files
7.34K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

📝 Summary:
This paper introduces process-driven image generation, an iterative method with interleaved textual and visual reasoning. It decomposes synthesis into planning, drafting, reflecting, and refining steps. Dense step-wise supervision ensures consistency and interpretability of intermediate states.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04746
• PDF: https://arxiv.org/pdf/2604.04746

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageGeneration #GenerativeAI #ArtificialIntelligence #DeepLearning #ComputerVision
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.

🔹 Publication Date: Published on May 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG

🔹 Models citing this paper:
https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
This media is not supported in your browser
VIEW IN TELEGRAM
WildDet3D: Scaling Promptable 3D Detection in the Wild

📝 Summary:
WildDet3D is a unified architecture for open-world 3D object detection, accepting multiple prompt types and integrating geometric cues. It leverages WildDet3D-Data, the largest 3D dataset, to achieve state-of-the-art performance across benchmarks, with significant gains from incorporating depth i...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08626
• PDF: https://arxiv.org/pdf/2604.08626
• Project Page: https://allenai.github.io/WildDet3D/
• Github: https://github.com/allenai/WildDet3D

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DObjectDetection #ComputerVision #DeepLearning #AI #Datasets
Structured Causal Video Reasoning via Multi-Objective Alignment

📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...

🔹 Publication Date: Published on Apr 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

📝 Summary:
3DTV is a feedforward network combining lightweight geometry and learning for real-time, robust sparse-view interpolation. It generates novel views efficiently without scene-specific optimization, making it practical for interactive applications.

🔹 Publication Date: Published on Apr 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.11211
• PDF: https://arxiv.org/pdf/2604.11211
• Project Page: https://stefanmschulz.github.io/3DTV_webpage/
• Github: https://github.com/StefanMSchulz/3DTV

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ViewSynthesis #DeepLearning #ComputerVision #NeuralNetworks #RealTimeAI
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoEditing #VFX #AI #ComputerVision #Benchmarks
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.

🔹 Publication Date: Published on Apr 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.

🔹 Publication Date: Published on Apr 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP
Media is too big
VIEW IN TELEGRAM
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...

🔹 Publication Date: Published on Apr 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation
This media is not supported in your browser
VIEW IN TELEGRAM
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch