ML Research Hub
32.5K subscribers
6K photos
385 videos
24 files
6.49K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

📝 Summary:
SNCE is a novel training objective for large-codebook discrete image generators. It supervises models with a soft categorical distribution over neighboring tokens, based on embedding proximity, instead of hard one-hot targets. This approach significantly improves convergence speed and overall gen...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15150
• PDF: https://arxiv.org/pdf/2603.15150

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageGeneration #DeepLearning #ComputerVision #GeometryAware #AIResearch
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

📝 Summary:
STALL is a training-free, model-agnostic detector for generated videos. It jointly models spatial and temporal evidence from real-data statistics within a probabilistic framework. STALL consistently outperforms prior image and video-based baselines, improving reliable detection.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15026
• PDF: https://arxiv.org/pdf/2603.15026
• Project Page: https://omerbenhayun.github.io/stall-video/
• Github: https://github.com/OmerBenHayun/stall-video

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#Deepfakes #VideoDetection #ComputerVision #AI #DigitalForensics
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

📝 Summary:
GlyphPrinter improves visual text rendering by addressing glyph accuracy. It introduces Region-Grouped DPO R-GDPO with region-level preferences from the GlyphCorrector dataset, significantly enhancing precision. This outperforms existing methods.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15616
• PDF: https://arxiv.org/pdf/2603.15616
• Project Page: https://henghuiding.com/GlyphPrinter/
• Github: https://github.com/FudanCVL/GlyphPrinter

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#GlyphRendering #DeepLearning #ComputerVision #AIResearch #TextRendering
Learning Latent Proxies for Controllable Single-Image Relighting

📝 Summary:
Single-image relighting is challenging due to unobserved geometry and materials. LightCtrl introduces a diffusion model guided by sparse, physically meaningful cues from a latent proxy encoder and lighting-aware masks. This enables photometrically faithful relighting with accurate control, outper...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15555
• PDF: https://arxiv.org/pdf/2603.15555

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ImageRelighting #DiffusionModels #ComputerVision #DeepLearning #AIResearch
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

📝 Summary:
IOMM is a data-efficient framework for UMM visual generation. It pre-trains with image-only data then fine-tunes with mixed data, achieving SOTA performance while significantly reducing computational costs.

🔹 Publication Date: Published on Mar 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16139
• PDF: https://arxiv.org/pdf/2603.16139
• Github: https://github.com/LINs-lab/IOMM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#UMMVisualGeneration #MaskedModeling #EfficientAI #ComputerVision #GenerativeAI
WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

📝 Summary:
Waypoint Diffusion Transformers WiT address trajectory conflicts in pixel-space flow matching using semantic waypoints from pre-trained vision models. WiT disentangles generation paths into segments, accelerating training convergence. It outperforms pixel-space baselines and speeds up JiT trainin...

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15132
• PDF: https://arxiv.org/pdf/2603.15132
• Project Page: https://hainuo-wang.github.io/WiT/
• Github: https://github.com/hainuo-wang/WiT

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #Transformers #ComputerVision #DeepLearning #AI
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

📝 Summary:
VLA models struggle to integrate visual detail for action generation. DeepVision-VLA enhances visual representations via multi-level feature injection and action-guided pruning. This significantly boosts performance on robotic tasks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15618
• PDF: https://arxiv.org/pdf/2603.15618
• Project Page: https://deepvision-vla.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VLAModels #ComputerVision #Robotics #DeepLearning #FoundationModels
Video-CoE: Reinforcing Video Event Prediction via Chain of Events

📝 Summary:
Video-CoE introduces a Chain of Events CoE paradigm to improve video event prediction. It addresses MLLM limitations in logical reasoning and visual utilization by constructing temporal event chains and using enhanced training. CoE achieves state-of-the-art performance on VEP benchmarks.

🔹 Publication Date: Published on Mar 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14935
• PDF: https://arxiv.org/pdf/2603.14935

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoEventPrediction #ChainOfEvents #MLLM #ComputerVision #AI
This media is not supported in your browser
VIEW IN TELEGRAM
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

📝 Summary:
CHROMM is a unified framework that jointly reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view videos. It integrates strong priors, handles scale discrepancies, and uses multi-view fusion for faster, more robust human-scene reconstruction.

🔹 Publication Date: Published on Mar 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12789
• PDF: https://arxiv.org/pdf/2603.12789
• Project Page: https://nstar1125.github.io/chromm
• Github: https://nstar1125.github.io/chromm/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DReconstruction #ComputerVision #HumanSceneReconstruction #MultiViewVideo #AIResearch
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...

🔹 Publication Date: Published on Mar 15

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding
Prompt-Free Universal Region Proposal Network

📝 Summary:
PF-RPN is a novel network that identifies potential objects without needing external prompts, improving flexibility. It uses Sparse Image-Aware Adapters and Cascade Self-Prompting to localize objects, validated across 19 datasets. This method works across diverse domains with limited data.

🔹 Publication Date: Published on Mar 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.17554
• PDF: https://arxiv.org/pdf/2603.17554
• Github: https://github.com/tangqh03/PF-RPN

🔹 Models citing this paper:
https://huggingface.co/tangqh/PF-RPN

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#ObjectDetection #ComputerVision #DeepLearning #RPN #PromptFreeAI
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

📝 Summary:
EffectErase is a new video object removal method that effectively erases dynamic objects and their visual effects. It introduces VOR, a large dataset for training, and uses reciprocal learning with task-aware guidance for high-quality results.

🔹 Publication Date: Published on Mar 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19224
• PDF: https://arxiv.org/pdf/2603.19224
• Project Page: https://henghuiding.com/EffectErase/
• Github: https://github.com/FudanCVL/EffectErase

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VideoEditing #ComputerVision #ObjectRemoval #DeepLearning #AI
VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

📝 Summary:
VID-AD is a dataset for logical anomaly detection in industrial inspection, specifically addressing challenges from visual distractions. A new language-based framework is also proposed, which uses text descriptions and contrastive learning to capture logical attributes.

🔹 Publication Date: Published on Mar 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13964
• PDF: https://arxiv.org/pdf/2603.13964
• Github: https://github.com/nkthiroto/VID-AD

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AnomalyDetection #IndustrialInspection #ComputerVision #MachineLearning #Datasets
DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

📝 Summary:
DreamPartGen generates 3D objects by modeling part geometry and appearance with Duplex Part Latents. It captures inter-part relationships using Relational Semantic Latents for improved text-shape alignment. A co-denoising process ensures consistency and achieves state-of-the-art results.

🔹 Publication Date: Published on Mar 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19216
• PDF: https://arxiv.org/pdf/2603.19216
• Project Page: https://plan-lab.github.io/dreampartgen

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DGeneration #GenerativeAI #DeepLearning #ComputerVision #TextTo3D
1