ML Research Hub

✨Visual-ERM: Reward Modeling for Visual Equivalence

📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience

146 views03:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...

🔹 Publication Date: Published on Mar 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction

141 views03:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers

🔹 Models citing this paper:
• https://huggingface.co/ai9stars/Cheers

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch

133 views04:02

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.

🔹 Publication Date: Published on Mar 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP

223 views08:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

📝 Summary:
SNCE is a novel training objective for large-codebook discrete image generators. It supervises models with a soft categorical distribution over neighboring tokens, based on embedding proximity, instead of hard one-hot targets. This approach significantly improves convergence speed and overall gen...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15150
• PDF: https://arxiv.org/pdf/2603.15150

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ImageGeneration #DeepLearning #ComputerVision #GeometryAware #AIResearch

131 views07:04

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

📝 Summary:
STALL is a training-free, model-agnostic detector for generated videos. It jointly models spatial and temporal evidence from real-data statistics within a probabilistic framework. STALL consistently outperforms prior image and video-based baselines, improving reliable detection.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15026
• PDF: https://arxiv.org/pdf/2603.15026
• Project Page: https://omerbenhayun.github.io/stall-video/
• Github: https://github.com/OmerBenHayun/stall-video

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#Deepfakes #VideoDetection #ComputerVision #AI #DigitalForensics

87 views08:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

📝 Summary:
GlyphPrinter improves visual text rendering by addressing glyph accuracy. It introduces Region-Grouped DPO R-GDPO with region-level preferences from the GlyphCorrector dataset, significantly enhancing precision. This outperforms existing methods.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15616
• PDF: https://arxiv.org/pdf/2603.15616
• Project Page: https://henghuiding.com/GlyphPrinter/
• Github: https://github.com/FudanCVL/GlyphPrinter

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#GlyphRendering #DeepLearning #ComputerVision #AIResearch #TextRendering

106 views08:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Learning Latent Proxies for Controllable Single-Image Relighting

📝 Summary:
Single-image relighting is challenging due to unobserved geometry and materials. LightCtrl introduces a diffusion model guided by sparse, physically meaningful cues from a latent proxy encoder and lighting-aware masks. This enables photometrically faithful relighting with accurate control, outper...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15555
• PDF: https://arxiv.org/pdf/2603.15555

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ImageRelighting #DiffusionModels #ComputerVision #DeepLearning #AIResearch

94 views09:06

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

📝 Summary:
IOMM is a data-efficient framework for UMM visual generation. It pre-trains with image-only data then fine-tunes with mixed data, achieving SOTA performance while significantly reducing computational costs.

🔹 Publication Date: Published on Mar 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16139
• PDF: https://arxiv.org/pdf/2603.16139
• Github: https://github.com/LINs-lab/IOMM

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#UMMVisualGeneration #MaskedModeling #EfficientAI #ComputerVision #GenerativeAI

73 views07:05

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

📝 Summary:
Waypoint Diffusion Transformers WiT address trajectory conflicts in pixel-space flow matching using semantic waypoints from pre-trained vision models. WiT disentangles generation paths into segments, accelerating training convergence. It outperforms pixel-space baselines and speeds up JiT trainin...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15132
• PDF: https://arxiv.org/pdf/2603.15132
• Project Page: https://hainuo-wang.github.io/WiT/
• Github: https://github.com/hainuo-wang/WiT

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#DiffusionModels #Transformers #ComputerVision #DeepLearning #AI

110 views09:07

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

📝 Summary:
VLA models struggle to integrate visual detail for action generation. DeepVision-VLA enhances visual representations via multi-level feature injection and action-guided pruning. This significantly boosts performance on robotic tasks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15618
• PDF: https://arxiv.org/pdf/2603.15618
• Project Page: https://deepvision-vla.github.io/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VLAModels #ComputerVision #Robotics #DeepLearning #FoundationModels

161 views08:03

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Video-CoE: Reinforcing Video Event Prediction via Chain of Events

📝 Summary:
Video-CoE introduces a Chain of Events CoE paradigm to improve video event prediction. It addresses MLLM limitations in logical reasoning and visual utilization by constructing temporal event chains and using enhanced training. CoE achieves state-of-the-art performance on VEP benchmarks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14935
• PDF: https://arxiv.org/pdf/2603.14935

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoEventPrediction #ChainOfEvents #MLLM #ComputerVision #AI

123 views09:59

✨ Explore Data Science 📝 Write your paper

ML Research Hub

0:09

This media is not supported in your browser

VIEW IN TELEGRAM

✨Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

📝 Summary:
CHROMM is a unified framework that jointly reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view videos. It integrates strong priors, handles scale discrepancies, and uses multi-view fusion for faster, more robust human-scene reconstruction.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12789
• PDF: https://arxiv.org/pdf/2603.12789
• Project Page: https://nstar1125.github.io/chromm
• Github: https://nstar1125.github.io/chromm/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#3DReconstruction #ComputerVision #HumanSceneReconstruction #MultiViewVideo #AIResearch

140 views16:00

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...

🔹 Publication Date: Published on Mar 15

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding

182 views19:01

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨Prompt-Free Universal Region Proposal Network

📝 Summary:
PF-RPN is a novel network that identifies potential objects without needing external prompts, improving flexibility. It uses Sparse Image-Aware Adapters and Cascade Self-Prompting to localize objects, validated across 19 datasets. This method works across diverse domains with limited data.

🔹 Publication Date: Published on Mar 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17554
• PDF: https://arxiv.org/pdf/2603.17554
• Github: https://github.com/tangqh03/PF-RPN

🔹 Models citing this paper:
• https://huggingface.co/tangqh/PF-RPN

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ObjectDetection #ComputerVision #DeepLearning #RPN #PromptFreeAI

115 views07:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

📝 Summary:
EffectErase is a new video object removal method that effectively erases dynamic objects and their visual effects. It introduces VOR, a large dataset for training, and uses reciprocal learning with task-aware guidance for high-quality results.

🔹 Publication Date: Published on Mar 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19224
• PDF: https://arxiv.org/pdf/2603.19224
• Project Page: https://henghuiding.com/EffectErase/
• Github: https://github.com/FudanCVL/EffectErase

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoEditing #ComputerVision #ObjectRemoval #DeepLearning #AI

132 views08:38

✨ Explore Data Science 📝 Write your paper

ML Research Hub

✨VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

📝 Summary:
VID-AD is a dataset for logical anomaly detection in industrial inspection, specifically addressing challenges from visual distractions. A new language-based framework is also proposed, which uses text descriptions and contrastive learning to capture logical attributes.

🔹 Publication Date: Published on Mar 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13964
• PDF: https://arxiv.org/pdf/2603.13964
• Github: https://github.com/nkthiroto/VID-AD

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#AnomalyDetection #IndustrialInspection #ComputerVision #MachineLearning #Datasets

175 views14:40

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform