ML Research Hub
32.5K subscribers
5.97K photos
384 videos
24 files
6.46K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
A Mixed Diet Makes DINO An Omnivorous Vision Encoder

📝 Summary:
The Omnivorous Vision Encoder learns modality-agnostic features by aligning multi-modal scene inputs and distilling semantics from a frozen teacher model. This resolves poor cross-modal alignment in existing encoders, yielding consistent, powerful embeddings for various modalities.

🔹 Publication Date: Published on Feb 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.24181
• PDF: https://arxiv.org/pdf/2602.24181

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #ComputerVision #DeepLearning #SelfSupervisedLearning #AIResearch
1
HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

📝 Summary:
HyPER-GAN is a lightweight U-Net based model for real-time photorealism enhancement. Its hybrid training strategy, using real-world patches, improves visual realism, semantic consistency, and inference speed over state-of-the-art methods.

🔹 Publication Date: Published on Mar 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10604
• PDF: https://arxiv.org/pdf/2603.10604
• Github: https://github.com/stefanos50/HyPER-GAN

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#GAN #ComputerVision #DeepLearning #ImageProcessing #Photorealism
Visual-ERM: Reward Modeling for Visual Equivalence

📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...

🔹 Publication Date: Published on Mar 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers

🔹 Models citing this paper:
https://huggingface.co/ai9stars/Cheers

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.

🔹 Publication Date: Published on Mar 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP
SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

📝 Summary:
SNCE is a novel training objective for large-codebook discrete image generators. It supervises models with a soft categorical distribution over neighboring tokens, based on embedding proximity, instead of hard one-hot targets. This approach significantly improves convergence speed and overall gen...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15150
• PDF: https://arxiv.org/pdf/2603.15150

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageGeneration #DeepLearning #ComputerVision #GeometryAware #AIResearch
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

📝 Summary:
STALL is a training-free, model-agnostic detector for generated videos. It jointly models spatial and temporal evidence from real-data statistics within a probabilistic framework. STALL consistently outperforms prior image and video-based baselines, improving reliable detection.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15026
• PDF: https://arxiv.org/pdf/2603.15026
• Project Page: https://omerbenhayun.github.io/stall-video/
• Github: https://github.com/OmerBenHayun/stall-video

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#Deepfakes #VideoDetection #ComputerVision #AI #DigitalForensics
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

📝 Summary:
GlyphPrinter improves visual text rendering by addressing glyph accuracy. It introduces Region-Grouped DPO R-GDPO with region-level preferences from the GlyphCorrector dataset, significantly enhancing precision. This outperforms existing methods.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15616
• PDF: https://arxiv.org/pdf/2603.15616
• Project Page: https://henghuiding.com/GlyphPrinter/
• Github: https://github.com/FudanCVL/GlyphPrinter

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#GlyphRendering #DeepLearning #ComputerVision #AIResearch #TextRendering
Learning Latent Proxies for Controllable Single-Image Relighting

📝 Summary:
Single-image relighting is challenging due to unobserved geometry and materials. LightCtrl introduces a diffusion model guided by sparse, physically meaningful cues from a latent proxy encoder and lighting-aware masks. This enables photometrically faithful relighting with accurate control, outper...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15555
• PDF: https://arxiv.org/pdf/2603.15555

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageRelighting #DiffusionModels #ComputerVision #DeepLearning #AIResearch
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

📝 Summary:
IOMM is a data-efficient framework for UMM visual generation. It pre-trains with image-only data then fine-tunes with mixed data, achieving SOTA performance while significantly reducing computational costs.

🔹 Publication Date: Published on Mar 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16139
• PDF: https://arxiv.org/pdf/2603.16139
• Github: https://github.com/LINs-lab/IOMM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#UMMVisualGeneration #MaskedModeling #EfficientAI #ComputerVision #GenerativeAI
WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

📝 Summary:
Waypoint Diffusion Transformers WiT address trajectory conflicts in pixel-space flow matching using semantic waypoints from pre-trained vision models. WiT disentangles generation paths into segments, accelerating training convergence. It outperforms pixel-space baselines and speeds up JiT trainin...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15132
• PDF: https://arxiv.org/pdf/2603.15132
• Project Page: https://hainuo-wang.github.io/WiT/
• Github: https://github.com/hainuo-wang/WiT

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #Transformers #ComputerVision #DeepLearning #AI
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

📝 Summary:
VLA models struggle to integrate visual detail for action generation. DeepVision-VLA enhances visual representations via multi-level feature injection and action-guided pruning. This significantly boosts performance on robotic tasks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15618
• PDF: https://arxiv.org/pdf/2603.15618
• Project Page: https://deepvision-vla.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VLAModels #ComputerVision #Robotics #DeepLearning #FoundationModels
Video-CoE: Reinforcing Video Event Prediction via Chain of Events

📝 Summary:
Video-CoE introduces a Chain of Events CoE paradigm to improve video event prediction. It addresses MLLM limitations in logical reasoning and visual utilization by constructing temporal event chains and using enhanced training. CoE achieves state-of-the-art performance on VEP benchmarks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14935
• PDF: https://arxiv.org/pdf/2603.14935

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoEventPrediction #ChainOfEvents #MLLM #ComputerVision #AI
This media is not supported in your browser
VIEW IN TELEGRAM
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

📝 Summary:
CHROMM is a unified framework that jointly reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view videos. It integrates strong priors, handles scale discrepancies, and uses multi-view fusion for faster, more robust human-scene reconstruction.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12789
• PDF: https://arxiv.org/pdf/2603.12789
• Project Page: https://nstar1125.github.io/chromm
• Github: https://nstar1125.github.io/chromm/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DReconstruction #ComputerVision #HumanSceneReconstruction #MultiViewVideo #AIResearch
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...

🔹 Publication Date: Published on Mar 15

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding