✨HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models
📝 Summary:
HeBA introduces a heterogeneous bottleneck adapter framework for Vision-Language Models. It uses modality-specific processing like convolutions for images and linear projections for text, combined with a compression bottleneck and active gradient initialization. This design improves few-shot lear...
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16653
• PDF: https://arxiv.org/pdf/2603.16653
• Project Page: https://huggingface.co/papers?q=dense%20linear%20projections
• Github: https://github.com/Jahid12012021/VLM-HeBA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #DeepLearning #AIResearch #ModelAdapters #FewShotLearning
📝 Summary:
HeBA introduces a heterogeneous bottleneck adapter framework for Vision-Language Models. It uses modality-specific processing like convolutions for images and linear projections for text, combined with a compression bottleneck and active gradient initialization. This design improves few-shot lear...
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16653
• PDF: https://arxiv.org/pdf/2603.16653
• Project Page: https://huggingface.co/papers?q=dense%20linear%20projections
• Github: https://github.com/Jahid12012021/VLM-HeBA
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #DeepLearning #AIResearch #ModelAdapters #FewShotLearning
✨Tinted Frames: Question Framing Blinds Vision-Language Models
📝 Summary:
Vision-language models suffer selective blindness, where linguistic framing degrades visual attention and performance. Constrained framings reduce focus on relevant image regions. A new prompt-tuning method improves visual grounding and performance across different framings.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19203
• PDF: https://arxiv.org/pdf/2603.19203
• Project Page: https://davidhalladay.github.io/tinted_frames_demo/
• Github: https://github.com/davidhalladay/Tinted-Frames
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #PromptEngineering #AIAttention #DeepLearning #AIResearch
📝 Summary:
Vision-language models suffer selective blindness, where linguistic framing degrades visual attention and performance. Constrained framings reduce focus on relevant image regions. A new prompt-tuning method improves visual grounding and performance across different framings.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19203
• PDF: https://arxiv.org/pdf/2603.19203
• Project Page: https://davidhalladay.github.io/tinted_frames_demo/
• Github: https://github.com/davidhalladay/Tinted-Frames
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #PromptEngineering #AIAttention #DeepLearning #AIResearch
✨SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
📝 Summary:
Sparse Embedding Modulation SEM debiases vision-language models by operating in a sparse autoencoder latent space. SEM precisely modulates bias-relevant neurons while preserving semantic information, achieving substantial fairness gains in retrieval and classification tasks.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19028
• PDF: https://arxiv.org/pdf/2603.19028
• Project Page: https://sparse-embedding-modulation.github.io/
• Github: https://github.com/mardgui/SEM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #BiasCorrection #MachineLearning #AIResearch #DeepLearning
📝 Summary:
Sparse Embedding Modulation SEM debiases vision-language models by operating in a sparse autoencoder latent space. SEM precisely modulates bias-relevant neurons while preserving semantic information, achieving substantial fairness gains in retrieval and classification tasks.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19028
• PDF: https://arxiv.org/pdf/2603.19028
• Project Page: https://sparse-embedding-modulation.github.io/
• Github: https://github.com/mardgui/SEM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #BiasCorrection #MachineLearning #AIResearch #DeepLearning
✨VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
📝 Summary:
VFIG is a vision-language model that converts raster images into scalable vector graphics SVG. It employs a 66K dataset and hierarchical training for high-fidelity conversion, outperforming open-source models and matching proprietary ones.
🔹 Publication Date: Published on Mar 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2603.24575
• PDF: https://arxiv.org/pdf/2603.24575
• Project Page: https://vfig-proj.github.io/
• Github: https://github.com/RAIVNLab/VFig
🔹 Models citing this paper:
• https://huggingface.co/XunmeiLiu/VFIG-4B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/allenai/VFig-Image2SVG-Demo
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #SVG #VectorGraphics #AI #ComputerVision
📝 Summary:
VFIG is a vision-language model that converts raster images into scalable vector graphics SVG. It employs a 66K dataset and hierarchical training for high-fidelity conversion, outperforming open-source models and matching proprietary ones.
🔹 Publication Date: Published on Mar 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2603.24575
• PDF: https://arxiv.org/pdf/2603.24575
• Project Page: https://vfig-proj.github.io/
• Github: https://github.com/RAIVNLab/VFig
🔹 Models citing this paper:
• https://huggingface.co/XunmeiLiu/VFIG-4B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/allenai/VFig-Image2SVG-Demo
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #SVG #VectorGraphics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
✨Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
📝 Summary:
Know3D integrates vision-language models into 3D generation via latent hidden-state injection. This enables language-controlled synthesis of unseen back-views, transforming stochastic hallucination into a semantically guided process for 3D assets.
🔹 Publication Date: Published on Mar 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.22782
• PDF: https://arxiv.org/pdf/2603.22782
• Project Page: https://xishuxishu.github.io/Know3D.github.io/
• Github: https://github.com/xishuxishu/Know3D
✨ Spaces citing this paper:
• https://huggingface.co/spaces/xishushu/Know3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #VisionLanguageModels #GenerativeAI #DeepLearning #AIResearch
📝 Summary:
Know3D integrates vision-language models into 3D generation via latent hidden-state injection. This enables language-controlled synthesis of unseen back-views, transforming stochastic hallucination into a semantically guided process for 3D assets.
🔹 Publication Date: Published on Mar 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.22782
• PDF: https://arxiv.org/pdf/2603.22782
• Project Page: https://xishuxishu.github.io/Know3D.github.io/
• Github: https://github.com/xishuxishu/Know3D
✨ Spaces citing this paper:
• https://huggingface.co/spaces/xishushu/Know3D
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #VisionLanguageModels #GenerativeAI #DeepLearning #AIResearch
✨A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
📝 Summary:
This paper finds that even state-of-the-art multi-billion parameter AI models struggle with surgical tool detection, a seemingly simple task. Scaling models further offers diminishing returns, suggesting fundamental limitations for current Vision Language Models in surgical use cases beyond just ...
🔹 Publication Date: Published on Mar 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.27341
• PDF: https://arxiv.org/pdf/2603.27341
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SurgicalAI #MedicalAI #FoundationModels #VisionLanguageModels #AIHealthcare
📝 Summary:
This paper finds that even state-of-the-art multi-billion parameter AI models struggle with surgical tool detection, a seemingly simple task. Scaling models further offers diminishing returns, suggesting fundamental limitations for current Vision Language Models in surgical use cases beyond just ...
🔹 Publication Date: Published on Mar 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.27341
• PDF: https://arxiv.org/pdf/2603.27341
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SurgicalAI #MedicalAI #FoundationModels #VisionLanguageModels #AIHealthcare
✨LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation
📝 Summary:
LinguDistill enables recovery of linguistic capabilities in vision-language models through adapter-free distillation using frozen language models as teachers, achieving performance close to pre-adapta...
🔹 Publication Date: Published on Apr 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.00829
• PDF: https://arxiv.org/pdf/2604.00829
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #NLP #ModelDistillation #ArtificialIntelligence #MachineLearning
📝 Summary:
LinguDistill enables recovery of linguistic capabilities in vision-language models through adapter-free distillation using frozen language models as teachers, achieving performance close to pre-adapta...
🔹 Publication Date: Published on Apr 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.00829
• PDF: https://arxiv.org/pdf/2604.00829
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #NLP #ModelDistillation #ArtificialIntelligence #MachineLearning
✨Vero: An Open RL Recipe for General Visual Reasoning
📝 Summary:
Vero is an open vision-language model family that achieves state-of-the-art visual reasoning performance through scaled reinforcement learning data across diverse tasks, demonstrating that broad data ...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04917
• PDF: https://arxiv.org/pdf/2604.04917
• Project Page: https://vero-reasoning.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisualReasoning #ReinforcementLearning #VisionLanguageModels #AIResearch #DeepLearning
📝 Summary:
Vero is an open vision-language model family that achieves state-of-the-art visual reasoning performance through scaled reinforcement learning data across diverse tasks, demonstrating that broad data ...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04917
• PDF: https://arxiv.org/pdf/2604.04917
• Project Page: https://vero-reasoning.github.io/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisualReasoning #ReinforcementLearning #VisionLanguageModels #AIResearch #DeepLearning
✨VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
✨CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels