ML Research Hub
32.3K subscribers
6.74K photos
473 videos
24 files
7.35K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

📝 Summary:
AnomalyVFM is a framework that enhances vision foundation models for zero-shot anomaly detection through synthetic dataset generation and parameter-efficient adaptation, achieving superior performance...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.20524
• PDF: https://arxiv.org/pdf/2601.20524
• Project Page: https://maticfuc.github.io/anomaly_vfm/
• Github: https://github.com/MaticFuc/AnomalyVFM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.

🔹 Publication Date: Published on May 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG

🔹 Models citing this paper:
https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

📝 Summary:
Researchers investigate how reinforcement learning with verifiable rewards can improve visual reasoning accuracy while maintaining logical consistency and visual grounding in multimodal reasoning mode...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08476
• PDF: https://arxiv.org/pdf/2604.08476

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
2
Media is too big
VIEW IN TELEGRAM
Small Vision-Language Models are Smart Compressors for Long Video Understanding

📝 Summary:
Tempo is an efficient framework that compresses long videos for multimodal understanding by using a small vision-language model for temporal compression and adaptive token allocation to maintain inten...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08120
• PDF: https://arxiv.org/pdf/2604.08120
• Project Page: https://feielysia.github.io/tempo-page/
• Github: https://feielysia.github.io/tempo-page/

🔹 Models citing this paper:
https://huggingface.co/Vision-CAIR/Tempo-6B

Spaces citing this paper:
https://huggingface.co/spaces/Vision-CAIR/Tempo

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

📝 Summary:
A geometry-guided method for multi-camera depth estimation that improves consistency across overlapping images using cylindrical spatial attention mechanisms. AI-generated summary Self-supervised surr...

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16428
• PDF: https://arxiv.org/pdf/2511.16428
• Project Page: https://abualhanud.github.io/CylinderDepthPage/
• Github: https://abualhanud.github.io/CylinderDepthPage/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Training a Student Expert via Semi-Supervised Foundation Model Distillation

📝 Summary:
A semi-supervised distillation framework compresses vision foundation models into compact experts for instance segmentation. It uses limited labeled and abundant unlabeled data, employing a novel instance-aware contrastive loss. The student models outperform their teachers and state-of-the-art SSKD.

🔹 Publication Date: Published on Apr 4

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.03841
• PDF: https://arxiv.org/pdf/2604.03841

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

📝 Summary:
Post-trained model capabilities can be transferred across different model scales through linear alignment of latent subspace directions without requiring retraining. AI-generated summary We investigat...

🔹 Publication Date: Published on Apr 7

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06377
• PDF: https://arxiv.org/pdf/2604.06377

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#MachineLearning #AI #DeepLearning #ModelTransfer #SubspaceAlignment
QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

📝 Summary:
QEIL v2 improves energy efficiency and performance of large language model inference on edge devices through physics-based adaptive optimization and workload-aware resource allocation. AI-generated su...

🔹 Publication Date: Published on Apr 5

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.06057
• PDF: https://arxiv.org/pdf/2602.06057

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

📝 Summary:
Current vision-language models struggle to infer structured cultural metadata from images consistently across cultures. This paper introduces a new cross-cultural benchmark for this task. Results show models give fragmented, inconsistent, and weakly grounded predictions, revealing significant lim...

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07338
• PDF: https://arxiv.org/pdf/2604.07338
• Project Page: https://huggingface.co/datasets/Carolyn-Jiang/Metadata-Inference

Datasets citing this paper:
https://huggingface.co/datasets/Carolyn-Jiang/Metadata-Inference

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
📝 12 Essential Articles for Data Scientists

🏷 Article: Seq2Seq Learning with NN
https://arxiv.org/pdf/1409.3215
An introduction to Seq2Seq models, which serve as the foundation for machine translation utilizing deep learning.

🏷 Article: GANs
https://arxiv.org/pdf/1406.2661
An introduction to Generative Adversarial Networks (GANs) and the concept of generating synthetic data. This forms the basis for creating images and videos with artificial intelligence.

🏷 Article: Attention is All You Need
https://arxiv.org/pdf/1706.03762
This paper was revolutionary in natural language processing. It introduced the Transformer architecture, which underlies GPT, BERT, and contemporary intelligent language models.

🏷 Article: Deep Residual Learning
https://arxiv.org/pdf/1512.03385
This work introduced the ResNet model, enabling neural networks to achieve greater depth and accuracy without compromising the learning process.

🏷 Article: Batch Normalization
https://arxiv.org/pdf/1502.03167
This paper introduced a technique that facilitates faster and more stable training of neural networks.

🏷 Article: Dropout
https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
A straightforward method designed to prevent overfitting in neural networks.

🏷 Article: ImageNet Classification with DCNN
https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
The first successful application of a deep neural network for image recognition.

🏷 Article: Support-Vector Machines
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
This seminal work introduced the Support Vector Machine (SVM) algorithm, a widely utilized method for data classification.

🏷 Article: A Few Useful Things to Know About ML
https://homes.cs.washington.edu/~pedro/papers/cacm12.pdf
A comprehensive collection of practical and empirical insights regarding machine learning.

🏷 Article: Gradient Boosting Machine
https://www.cse.iitb.ac.in/~soumen/readings/papers/Friedman1999GreedyFuncApprox.pdf
This paper introduced the "Gradient Boosting" method, which serves as the foundation for many modern machine learning models, including XGBoost and LightGBM.

🏷 Article: Latent Dirichlet Allocation
https://jmlr.org/papers/volume3/blei03a/blei03a.pdf
This work introduced a model for text analysis capable of identifying the topics discussed within an article.

🏷 Article: Random Forests
https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
This paper introduced the "Random Forest" algorithm, a powerful machine learning method that aggregates multiple models to achieve enhanced accuracy.

https://t.iss.one/CodeProgrammer 🌟
Please open Telegram to view this post
VIEW IN TELEGRAM
2
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI
Media is too big
VIEW IN TELEGRAM
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

📝 Summary:
Matrix-Game 3.0 is a memory-augmented diffusion model achieving real-time 720p interactive video generation with long-term temporal consistency. It uses an advanced data engine, a self-correction training framework with memory, and efficient inference strategies. This enables practical, industria...

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08995
• PDF: https://arxiv.org/pdf/2604.08995
• Project Page: https://matrix-game-v3.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#DiffusionModels #VideoGeneration #RealTimeAI #GenerativeAI #MachineLearning
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
ELT: Elastic Looped Transformers for Visual Generation

📝 Summary:
Elastic Looped Transformers utilize recurrent transformer architecture with weight-sharing and intra-loop self-distillation to achieve parameter-efficient visual generation with adjustable computation...

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09168
• PDF: https://arxiv.org/pdf/2604.09168

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

📝 Summary:
VisionFoundry creates synthetic visual question answering data using LLMs and text-to-image models to improve VLM visual perception. Training with this targeted data significantly boosts model performance on visual perception benchmarks like MMVP and CV-Bench-3D.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09531
• PDF: https://arxiv.org/pdf/2604.09531
• Project Page: https://zlab-princeton.github.io/VisionFoundry/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VLM #VisualPerception #SyntheticData #LLM #AI
EXAONE 4.5 Technical Report

📝 Summary:
EXAONE 4.5 is LG AI Research's first open-weight vision language model, integrating a visual encoder into EXAONE 4.0. It enhances document understanding and general language capabilities through targeted data and extended context, outperforming similar models in document tasks.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08644
• PDF: https://arxiv.org/pdf/2604.08644
• Github: https://github.com/LG-AI-EXAONE/EXAONE-4.5

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModel #AI #DocumentUnderstanding #MultimodalAI #OpenSourceAI
This media is not supported in your browser
VIEW IN TELEGRAM
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

📝 Summary:
FORGE introduces a multimodal manufacturing dataset, revealing that MLLM performance is limited by domain-specific knowledge, not visual grounding. Fine-tuning on FORGEs annotations significantly improves accuracy, offering a path for domain-adapted MLLMs.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07413
• PDF: https://arxiv.org/pdf/2604.07413
• Project Page: https://ai4manufacturing.github.io/forge-web/
• Github: https://github.com/AI4Manufacturing/FORGE

Datasets citing this paper:
https://huggingface.co/datasets/AI4Manufacturing/forge

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#FORGE #MLLM #ManufacturingAI #MultimodalAI #DomainAdaptation
This media is not supported in your browser
VIEW IN TELEGRAM
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

📝 Summary:
A novel cross-modal emotion transfer approach generates expressive talking face videos by modeling emotion semantic vectors between speech and visual feature spaces, achieving superior emotion accurac...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07786
• PDF: https://arxiv.org/pdf/2604.07786
• Project Page: https://chanhyeok-choi.github.io/C-MET/
• Github: https://github.com/ChanHyeok-Choi/C-MET

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
This media is not supported in your browser
VIEW IN TELEGRAM
WildDet3D: Scaling Promptable 3D Detection in the Wild

📝 Summary:
WildDet3D is a unified architecture for open-world 3D object detection, accepting multiple prompt types and integrating geometric cues. It leverages WildDet3D-Data, the largest 3D dataset, to achieve state-of-the-art performance across benchmarks, with significant gains from incorporating depth i...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08626
• PDF: https://arxiv.org/pdf/2604.08626
• Project Page: https://allenai.github.io/WildDet3D/
• Github: https://github.com/allenai/WildDet3D

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DObjectDetection #ComputerVision #DeepLearning #AI #Datasets
Structured Causal Video Reasoning via Multi-Objective Alignment

📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...

🔹 Publication Date: Published on Apr 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

📝 Summary:
ECHO is an efficient diffusion model for chest X-ray report generation. It achieves fast one-step-per-block inference using Direct Conditional Distillation and Response-Asymmetric Diffusion. ECHO delivers an 8x speedup and improved accuracy over state-of-the-art methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09450
• PDF: https://arxiv.org/pdf/2604.09450
• Project Page: https://echo-midea-airc.github.io/
• Github: https://github.com/clf28/ECHO

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research