✨Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
📝 Summary:
A novel fine-tuning method improves Vision Transformer robustness to distribution shifts. It aligns ViT attention with AI-generated concept masks, shifting focus from spurious correlations to semantic features. This boosts out-of-distribution performance and model interpretability.
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08309
• PDF: https://arxiv.org/pdf/2603.08309
• Project Page: https://yonisgit.github.io/concept-ft/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ComputerVision #VisionTransformers #MLRobustness #ModelInterpretability
📝 Summary:
A novel fine-tuning method improves Vision Transformer robustness to distribution shifts. It aligns ViT attention with AI-generated concept masks, shifting focus from spurious correlations to semantic features. This boosts out-of-distribution performance and model interpretability.
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08309
• PDF: https://arxiv.org/pdf/2603.08309
• Project Page: https://yonisgit.github.io/concept-ft/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#AI #ComputerVision #VisionTransformers #MLRobustness #ModelInterpretability
✨TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
📝 Summary:
TAPFormer is a new transformer framework for robust arbitrary point tracking. It uses Transient Asynchronous Fusion to bridge low-rate frames and high-rate events, and Cross-modal Locally Weighted Fusion for adaptive attention. This method significantly outperforms existing trackers.
🔹 Publication Date: Published on Mar 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.04989
• PDF: https://arxiv.org/pdf/2603.04989
• Project Page: https://tapformer.github.io/
• Github: https://github.com/ljx1002/TAPFormer
🔹 Models citing this paper:
• https://huggingface.co/ljx1002/tapformer
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ljx1002/tapformer
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#PointTracking #Transformers #ComputerVision #EventCameras #DeepLearning
📝 Summary:
TAPFormer is a new transformer framework for robust arbitrary point tracking. It uses Transient Asynchronous Fusion to bridge low-rate frames and high-rate events, and Cross-modal Locally Weighted Fusion for adaptive attention. This method significantly outperforms existing trackers.
🔹 Publication Date: Published on Mar 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.04989
• PDF: https://arxiv.org/pdf/2603.04989
• Project Page: https://tapformer.github.io/
• Github: https://github.com/ljx1002/TAPFormer
🔹 Models citing this paper:
• https://huggingface.co/ljx1002/tapformer
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ljx1002/tapformer
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#PointTracking #Transformers #ComputerVision #EventCameras #DeepLearning
✨TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
📝 Summary:
TALON is a test-time adaptation framework for on-the-fly category discovery. It dynamically updates prototypes and encoder parameters, while calibrating logits, to improve novel class recognition and prevent category explosion. This approach significantly outperforms existing methods.
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08075
• PDF: https://arxiv.org/pdf/2603.08075
• Github: https://github.com/ynanwu/TALON
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MachineLearning #DeepLearning #CategoryDiscovery #TestTimeAdaptation #ComputerVision
📝 Summary:
TALON is a test-time adaptation framework for on-the-fly category discovery. It dynamically updates prototypes and encoder parameters, while calibrating logits, to improve novel class recognition and prevent category explosion. This approach significantly outperforms existing methods.
🔹 Publication Date: Published on Mar 9
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.08075
• PDF: https://arxiv.org/pdf/2603.08075
• Github: https://github.com/ynanwu/TALON
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MachineLearning #DeepLearning #CategoryDiscovery #TestTimeAdaptation #ComputerVision
✨4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
📝 Summary:
4DEquine is a new framework for 4D equine reconstruction from monocular video. It disentangles motion using spatio-temporal transformers and appearance with 3D Gaussian avatars. Training on synthetic data, it achieves state-of-the-art results on real-world datasets.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10125
• PDF: https://arxiv.org/pdf/2603.10125
• Project Page: https://luoxue-star.github.io/4DEquine_Project_Page/
• Github: https://github.com/luoxue-star/4DEquine
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #4DReconstruction #DeepLearning #Equine #AI
📝 Summary:
4DEquine is a new framework for 4D equine reconstruction from monocular video. It disentangles motion using spatio-temporal transformers and appearance with 3D Gaussian avatars. Training on synthetic data, it achieves state-of-the-art results on real-world datasets.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10125
• PDF: https://arxiv.org/pdf/2603.10125
• Project Page: https://luoxue-star.github.io/4DEquine_Project_Page/
• Github: https://github.com/luoxue-star/4DEquine
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #4DReconstruction #DeepLearning #Equine #AI
✨A Mixed Diet Makes DINO An Omnivorous Vision Encoder
📝 Summary:
The Omnivorous Vision Encoder learns modality-agnostic features by aligning multi-modal scene inputs and distilling semantics from a frozen teacher model. This resolves poor cross-modal alignment in existing encoders, yielding consistent, powerful embeddings for various modalities.
🔹 Publication Date: Published on Feb 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.24181
• PDF: https://arxiv.org/pdf/2602.24181
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #SelfSupervisedLearning #AIResearch
📝 Summary:
The Omnivorous Vision Encoder learns modality-agnostic features by aligning multi-modal scene inputs and distilling semantics from a frozen teacher model. This resolves poor cross-modal alignment in existing encoders, yielding consistent, powerful embeddings for various modalities.
🔹 Publication Date: Published on Feb 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.24181
• PDF: https://arxiv.org/pdf/2602.24181
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #SelfSupervisedLearning #AIResearch
❤1
✨HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement
📝 Summary:
HyPER-GAN is a lightweight U-Net based model for real-time photorealism enhancement. Its hybrid training strategy, using real-world patches, improves visual realism, semantic consistency, and inference speed over state-of-the-art methods.
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10604
• PDF: https://arxiv.org/pdf/2603.10604
• Github: https://github.com/stefanos50/HyPER-GAN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GAN #ComputerVision #DeepLearning #ImageProcessing #Photorealism
📝 Summary:
HyPER-GAN is a lightweight U-Net based model for real-time photorealism enhancement. Its hybrid training strategy, using real-world patches, improves visual realism, semantic consistency, and inference speed over state-of-the-art methods.
🔹 Publication Date: Published on Mar 11
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.10604
• PDF: https://arxiv.org/pdf/2603.10604
• Github: https://github.com/stefanos50/HyPER-GAN
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GAN #ComputerVision #DeepLearning #ImageProcessing #Photorealism
✨Visual-ERM: Reward Modeling for Visual Equivalence
📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
📝 Summary:
Visual-ERM is a multimodal generative reward model providing fine-grained visual feedback for vision-to-code tasks. It significantly improves reinforcement learning performance for chart, table, and SVG parsing, demonstrating that fine-grained visual supervision is essential.
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13224
• PDF: https://arxiv.org/pdf/2603.13224
• Github: https://github.com/InternLM/Visual-ERM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ReinforcementLearning #ComputerVision #GenerativeAI #AI #DataScience
✨SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...
🔹 Publication Date: Published on Mar 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
📝 Summary:
SimRecon reconstructs cluttered scenes from real videos using a Perception-Generation-Simulation pipeline. It employs Active Viewpoint Optimization for visual fidelity and a Scene Graph Synthesizer for physical plausibility. This enables superior compositional scene representations for simulation...
🔹 Publication Date: Published on Mar 2
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.02133
• PDF: https://arxiv.org/pdf/2603.02133
• Project Page: https://xiac20.github.io/SimRecon/
• Github: https://github.com/xiac20/SimRecon
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#SceneReconstruction #ComputerVision #AI #Simulation #3DReconstruction
✨Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers
🔹 Models citing this paper:
• https://huggingface.co/ai9stars/Cheers
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
📝 Summary:
Cheers is a unified multimodal model that decouples visual details from semantic representations for efficient joint optimization of understanding and generation. It employs a vision tokenizer, LLM-based Transformer, and cascaded flow matching. Cheers achieves state-of-the-art performance with 4x...
🔹 Publication Date: Published on Mar 13
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.12793
• PDF: https://arxiv.org/pdf/2603.12793
• Project Page: https://huggingface.co/ai9stars/Cheers
• Github: https://github.com/AI9Stars/Cheers
🔹 Models citing this paper:
• https://huggingface.co/ai9stars/Cheers
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MultimodalAI #LLM #ComputerVision #GenerativeAI #AIResearch
✨Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP
📝 Summary:
This paper presents a novel text-motion retrieval method. It maps joint-angle motion features into Vision Transformer-compatible pseudo-images and uses an enhanced late interaction mechanism. This achieves superior performance and offers interpretable fine-grained text-motion alignments.
🔹 Publication Date: Published on Mar 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.09930
• PDF: https://arxiv.org/pdf/2603.09930
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#MotionRetrieval #DeepLearning #ComputerVision #AIResearch #NLP
✨SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation
📝 Summary:
SNCE is a novel training objective for large-codebook discrete image generators. It supervises models with a soft categorical distribution over neighboring tokens, based on embedding proximity, instead of hard one-hot targets. This approach significantly improves convergence speed and overall gen...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15150
• PDF: https://arxiv.org/pdf/2603.15150
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageGeneration #DeepLearning #ComputerVision #GeometryAware #AIResearch
📝 Summary:
SNCE is a novel training objective for large-codebook discrete image generators. It supervises models with a soft categorical distribution over neighboring tokens, based on embedding proximity, instead of hard one-hot targets. This approach significantly improves convergence speed and overall gen...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15150
• PDF: https://arxiv.org/pdf/2603.15150
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageGeneration #DeepLearning #ComputerVision #GeometryAware #AIResearch
✨Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
📝 Summary:
STALL is a training-free, model-agnostic detector for generated videos. It jointly models spatial and temporal evidence from real-data statistics within a probabilistic framework. STALL consistently outperforms prior image and video-based baselines, improving reliable detection.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15026
• PDF: https://arxiv.org/pdf/2603.15026
• Project Page: https://omerbenhayun.github.io/stall-video/
• Github: https://github.com/OmerBenHayun/stall-video
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#Deepfakes #VideoDetection #ComputerVision #AI #DigitalForensics
📝 Summary:
STALL is a training-free, model-agnostic detector for generated videos. It jointly models spatial and temporal evidence from real-data statistics within a probabilistic framework. STALL consistently outperforms prior image and video-based baselines, improving reliable detection.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15026
• PDF: https://arxiv.org/pdf/2603.15026
• Project Page: https://omerbenhayun.github.io/stall-video/
• Github: https://github.com/OmerBenHayun/stall-video
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#Deepfakes #VideoDetection #ComputerVision #AI #DigitalForensics
✨GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
📝 Summary:
GlyphPrinter improves visual text rendering by addressing glyph accuracy. It introduces Region-Grouped DPO R-GDPO with region-level preferences from the GlyphCorrector dataset, significantly enhancing precision. This outperforms existing methods.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15616
• PDF: https://arxiv.org/pdf/2603.15616
• Project Page: https://henghuiding.com/GlyphPrinter/
• Github: https://github.com/FudanCVL/GlyphPrinter
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GlyphRendering #DeepLearning #ComputerVision #AIResearch #TextRendering
📝 Summary:
GlyphPrinter improves visual text rendering by addressing glyph accuracy. It introduces Region-Grouped DPO R-GDPO with region-level preferences from the GlyphCorrector dataset, significantly enhancing precision. This outperforms existing methods.
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15616
• PDF: https://arxiv.org/pdf/2603.15616
• Project Page: https://henghuiding.com/GlyphPrinter/
• Github: https://github.com/FudanCVL/GlyphPrinter
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#GlyphRendering #DeepLearning #ComputerVision #AIResearch #TextRendering
✨Learning Latent Proxies for Controllable Single-Image Relighting
📝 Summary:
Single-image relighting is challenging due to unobserved geometry and materials. LightCtrl introduces a diffusion model guided by sparse, physically meaningful cues from a latent proxy encoder and lighting-aware masks. This enables photometrically faithful relighting with accurate control, outper...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15555
• PDF: https://arxiv.org/pdf/2603.15555
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageRelighting #DiffusionModels #ComputerVision #DeepLearning #AIResearch
📝 Summary:
Single-image relighting is challenging due to unobserved geometry and materials. LightCtrl introduces a diffusion model guided by sparse, physically meaningful cues from a latent proxy encoder and lighting-aware masks. This enables photometrically faithful relighting with accurate control, outper...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15555
• PDF: https://arxiv.org/pdf/2603.15555
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageRelighting #DiffusionModels #ComputerVision #DeepLearning #AIResearch
✨Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
📝 Summary:
IOMM is a data-efficient framework for UMM visual generation. It pre-trains with image-only data then fine-tunes with mixed data, achieving SOTA performance while significantly reducing computational costs.
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16139
• PDF: https://arxiv.org/pdf/2603.16139
• Github: https://github.com/LINs-lab/IOMM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#UMMVisualGeneration #MaskedModeling #EfficientAI #ComputerVision #GenerativeAI
📝 Summary:
IOMM is a data-efficient framework for UMM visual generation. It pre-trains with image-only data then fine-tunes with mixed data, achieving SOTA performance while significantly reducing computational costs.
🔹 Publication Date: Published on Mar 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.16139
• PDF: https://arxiv.org/pdf/2603.16139
• Github: https://github.com/LINs-lab/IOMM
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#UMMVisualGeneration #MaskedModeling #EfficientAI #ComputerVision #GenerativeAI
✨WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation
📝 Summary:
Waypoint Diffusion Transformers WiT address trajectory conflicts in pixel-space flow matching using semantic waypoints from pre-trained vision models. WiT disentangles generation paths into segments, accelerating training convergence. It outperforms pixel-space baselines and speeds up JiT trainin...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15132
• PDF: https://arxiv.org/pdf/2603.15132
• Project Page: https://hainuo-wang.github.io/WiT/
• Github: https://github.com/hainuo-wang/WiT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #Transformers #ComputerVision #DeepLearning #AI
📝 Summary:
Waypoint Diffusion Transformers WiT address trajectory conflicts in pixel-space flow matching using semantic waypoints from pre-trained vision models. WiT disentangles generation paths into segments, accelerating training convergence. It outperforms pixel-space baselines and speeds up JiT trainin...
🔹 Publication Date: Published on Mar 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.15132
• PDF: https://arxiv.org/pdf/2603.15132
• Project Page: https://hainuo-wang.github.io/WiT/
• Github: https://github.com/hainuo-wang/WiT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #Transformers #ComputerVision #DeepLearning #AI