ML Research Hub
32.3K subscribers
6.73K photos
472 videos
24 files
7.34K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

📝 Summary:
HeBA introduces a heterogeneous bottleneck adapter framework for Vision-Language Models. It uses modality-specific processing like convolutions for images and linear projections for text, combined with a compression bottleneck and active gradient initialization. This design improves few-shot lear...

🔹 Publication Date: Published on Mar 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16653
• PDF: https://arxiv.org/pdf/2603.16653
• Project Page: https://huggingface.co/papers?q=dense%20linear%20projections
• Github: https://github.com/Jahid12012021/VLM-HeBA

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #DeepLearning #AIResearch #ModelAdapters #FewShotLearning
Tinted Frames: Question Framing Blinds Vision-Language Models

📝 Summary:
Vision-language models suffer selective blindness, where linguistic framing degrades visual attention and performance. Constrained framings reduce focus on relevant image regions. A new prompt-tuning method improves visual grounding and performance across different framings.

🔹 Publication Date: Published on Mar 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19203
• PDF: https://arxiv.org/pdf/2603.19203
• Project Page: https://davidhalladay.github.io/tinted_frames_demo/
• Github: https://github.com/davidhalladay/Tinted-Frames

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #PromptEngineering #AIAttention #DeepLearning #AIResearch
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

📝 Summary:
Sparse Embedding Modulation SEM debiases vision-language models by operating in a sparse autoencoder latent space. SEM precisely modulates bias-relevant neurons while preserving semantic information, achieving substantial fairness gains in retrieval and classification tasks.

🔹 Publication Date: Published on Mar 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19028
• PDF: https://arxiv.org/pdf/2603.19028
• Project Page: https://sparse-embedding-modulation.github.io/
• Github: https://github.com/mardgui/SEM

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #BiasCorrection #MachineLearning #AIResearch #DeepLearning
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

📝 Summary:
VFIG is a vision-language model that converts raster images into scalable vector graphics SVG. It employs a 66K dataset and hierarchical training for high-fidelity conversion, outperforming open-source models and matching proprietary ones.

🔹 Publication Date: Published on Mar 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2603.24575
• PDF: https://arxiv.org/pdf/2603.24575
• Project Page: https://vfig-proj.github.io/
• Github: https://github.com/RAIVNLab/VFig

🔹 Models citing this paper:
https://huggingface.co/XunmeiLiu/VFIG-4B

Spaces citing this paper:
https://huggingface.co/spaces/allenai/VFig-Image2SVG-Demo

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #SVG #VectorGraphics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models

📝 Summary:
Know3D integrates vision-language models into 3D generation via latent hidden-state injection. This enables language-controlled synthesis of unseen back-views, transforming stochastic hallucination into a semantically guided process for 3D assets.

🔹 Publication Date: Published on Mar 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.22782
• PDF: https://arxiv.org/pdf/2603.22782
• Project Page: https://xishuxishu.github.io/Know3D.github.io/
• Github: https://github.com/xishuxishu/Know3D

Spaces citing this paper:
https://huggingface.co/spaces/xishushu/Know3D

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#3DGeneration #VisionLanguageModels #GenerativeAI #DeepLearning #AIResearch
A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

📝 Summary:
This paper finds that even state-of-the-art multi-billion parameter AI models struggle with surgical tool detection, a seemingly simple task. Scaling models further offers diminishing returns, suggesting fundamental limitations for current Vision Language Models in surgical use cases beyond just ...

🔹 Publication Date: Published on Mar 28

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.27341
• PDF: https://arxiv.org/pdf/2603.27341

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#SurgicalAI #MedicalAI #FoundationModels #VisionLanguageModels #AIHealthcare
LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

📝 Summary:
LinguDistill enables recovery of linguistic capabilities in vision-language models through adapter-free distillation using frozen language models as teachers, achieving performance close to pre-adapta...

🔹 Publication Date: Published on Apr 1

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.00829
• PDF: https://arxiv.org/pdf/2604.00829

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisionLanguageModels #NLP #ModelDistillation #ArtificialIntelligence #MachineLearning
Vero: An Open RL Recipe for General Visual Reasoning

📝 Summary:
Vero is an open vision-language model family that achieves state-of-the-art visual reasoning performance through scaled reinforcement learning data across diverse tasks, demonstrating that broad data ...

🔹 Publication Date: Published on Apr 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04917
• PDF: https://arxiv.org/pdf/2604.04917
• Project Page: https://vero-reasoning.github.io/

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#VisualReasoning #ReinforcementLearning #VisionLanguageModels #AIResearch #DeepLearning
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.

🔹 Publication Date: Published on May 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG

🔹 Models citing this paper:
https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
https://t.iss.one/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels