This media is not supported in your browser
VIEW IN TELEGRAM
✨Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
📝 Summary:
DSR Suite improves vision language models weak dynamic spatial reasoning. It creates 4D training data from videos using an automated pipeline and integrates geometric priors via a Geometry Selection Module. This significantly enhances VLM dynamic spatial reasoning capability while maintaining gen...
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.20557
• PDF: https://arxiv.org/pdf/2512.20557
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #SpatialReasoning #4D #ComputerVision #AIResearch
✨Latent Implicit Visual Reasoning
📝 Summary:
Large Multimodal Models struggle with visual reasoning due to their text-centric nature and limitations of prior methods. This paper introduces a task-agnostic mechanism for LMMs to discover and use visual reasoning tokens without explicit supervision. The approach achieves state-of-the-art resul...
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21218
• PDF: https://arxiv.org/pdf/2512.21218
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LMMs #VisualReasoning #AI #ComputerVision #DeepLearning
📝 Summary:
Large Multimodal Models struggle with visual reasoning due to their text-centric nature and limitations of prior methods. This paper introduces a task-agnostic mechanism for LMMs to discover and use visual reasoning tokens without explicit supervision. The approach achieves state-of-the-art resul...
🔹 Publication Date: Published on Dec 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21218
• PDF: https://arxiv.org/pdf/2512.21218
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#LMMs #VisualReasoning #AI #ComputerVision #DeepLearning
❤1
Media is too big
VIEW IN TELEGRAM
✨Spatia: Video Generation with Updatable Spatial Memory
📝 Summary:
Spatia is a video generation framework that improves long-term consistency by using an updatable 3D scene point cloud as persistent spatial memory. It iteratively generates video clips and updates this memory via visual SLAM, enabling realistic videos and 3D-aware interactive editing.
🔹 Publication Date: Published on Dec 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.15716
• PDF: https://arxiv.org/pdf/2512.15716
• Project Page: https://zhaojingjing713.github.io/Spatia/
• Github: https://github.com/ZhaoJingjing713/Spatia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #3DReconstruction #SLAM
📝 Summary:
Spatia is a video generation framework that improves long-term consistency by using an updatable 3D scene point cloud as persistent spatial memory. It iteratively generates video clips and updates this memory via visual SLAM, enabling realistic videos and 3D-aware interactive editing.
🔹 Publication Date: Published on Dec 17
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.15716
• PDF: https://arxiv.org/pdf/2512.15716
• Project Page: https://zhaojingjing713.github.io/Spatia/
• Github: https://github.com/ZhaoJingjing713/Spatia
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #3DReconstruction #SLAM
❤1
✨How Much 3D Do Video Foundation Models Encode?
📝 Summary:
A new framework quantifies 3D understanding in Video Foundation Models VidFMs. VidFMs, trained only on video, show strong 3D awareness, often surpassing expert 3D models, providing insights for 3D AI.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19949
• PDF: https://arxiv.org/pdf/2512.19949
• Project Page: https://vidfm-3d-probe.github.io/
• Github: https://vidfm-3d-probe.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoFoundationModels #3DUnderstanding #ComputerVision #AIResearch #DeepLearning
📝 Summary:
A new framework quantifies 3D understanding in Video Foundation Models VidFMs. VidFMs, trained only on video, show strong 3D awareness, often surpassing expert 3D models, providing insights for 3D AI.
🔹 Publication Date: Published on Dec 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19949
• PDF: https://arxiv.org/pdf/2512.19949
• Project Page: https://vidfm-3d-probe.github.io/
• Github: https://vidfm-3d-probe.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoFoundationModels #3DUnderstanding #ComputerVision #AIResearch #DeepLearning
❤2
✨Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
📝 Summary:
Fast3R is a Transformer-based method for efficient and scalable multi-view 3D reconstruction. It processes many images in parallel in a single forward pass, improving speed and accuracy over pairwise approaches like DUSt3R.
🔹 Publication Date: Published on Jan 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.13928
• PDF: https://arxiv.org/pdf/2501.13928
• Github: https://github.com/naver/dust3r/pull/16
🔹 Models citing this paper:
• https://huggingface.co/jedyang97/Fast3R_ViT_Large_512
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #Transformers #Fast3R #DeepLearning
📝 Summary:
Fast3R is a Transformer-based method for efficient and scalable multi-view 3D reconstruction. It processes many images in parallel in a single forward pass, improving speed and accuracy over pairwise approaches like DUSt3R.
🔹 Publication Date: Published on Jan 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2501.13928
• PDF: https://arxiv.org/pdf/2501.13928
• Github: https://github.com/naver/dust3r/pull/16
🔹 Models citing this paper:
• https://huggingface.co/jedyang97/Fast3R_ViT_Large_512
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #Transformers #Fast3R #DeepLearning
This media is not supported in your browser
VIEW IN TELEGRAM
✨InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
📝 Summary:
InsertAnywhere is a framework for realistic video object insertion. It uses 4D aware mask generation for geometric consistency and an extended diffusion model for appearance-faithful synthesis, outperforming existing methods.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17504
• PDF: https://arxiv.org/pdf/2512.17504
• Project Page: https://myyzzzoooo.github.io/InsertAnywhere/
• Github: https://github.com/myyzzzoooo/InsertAnywhere
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #DiffusionModels #ComputerVision #DeepLearning #GenerativeAI
📝 Summary:
InsertAnywhere is a framework for realistic video object insertion. It uses 4D aware mask generation for geometric consistency and an extended diffusion model for appearance-faithful synthesis, outperforming existing methods.
🔹 Publication Date: Published on Dec 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.17504
• PDF: https://arxiv.org/pdf/2512.17504
• Project Page: https://myyzzzoooo.github.io/InsertAnywhere/
• Github: https://github.com/myyzzzoooo/InsertAnywhere
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VideoEditing #DiffusionModels #ComputerVision #DeepLearning #GenerativeAI
❤1
✨UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
📝 Summary:
UniPercept-Bench provides a unified framework and datasets for perceptual image understanding aesthetics, quality, structure, texture. The UniPercept model, trained with DAPT and T-ARL, outperforms MLLMs, generalizes across VR and VQA, and acts as a text-to-image reward model.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21675
• PDF: https://arxiv.org/pdf/2512.21675
• Project Page: https://thunderbolt215.github.io/Unipercept-project/
• Github: https://github.com/thunderbolt215/UniPercept
🔹 Models citing this paper:
• https://huggingface.co/Thunderbolt215215/UniPercept
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Thunderbolt215215/UniPercept-Bench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageUnderstanding #ComputerVision #AIResearch #PerceptualAI #DeepLearning
📝 Summary:
UniPercept-Bench provides a unified framework and datasets for perceptual image understanding aesthetics, quality, structure, texture. The UniPercept model, trained with DAPT and T-ARL, outperforms MLLMs, generalizes across VR and VQA, and acts as a text-to-image reward model.
🔹 Publication Date: Published on Dec 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.21675
• PDF: https://arxiv.org/pdf/2512.21675
• Project Page: https://thunderbolt215.github.io/Unipercept-project/
• Github: https://github.com/thunderbolt215/UniPercept
🔹 Models citing this paper:
• https://huggingface.co/Thunderbolt215215/UniPercept
✨ Datasets citing this paper:
• https://huggingface.co/datasets/Thunderbolt215215/UniPercept-Bench
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageUnderstanding #ComputerVision #AIResearch #PerceptualAI #DeepLearning
arXiv.org
UniPercept: Towards Unified Perceptual-Level Image Understanding...
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive...
❤1
✨Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
📝 Summary:
Transparent objects are hard for perception. This work observes video diffusion models can synthesize transparent phenomena, so they repurpose one. Their DKT model, trained on a new dataset, achieves zero-shot SOTA for depth and normal estimation of transparent objects, proving diffusion knows tr...
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23705
• PDF: https://arxiv.org/pdf/2512.23705
• Project Page: https://daniellli.github.io/projects/DKT/
• Github: https://github.com/Daniellli/DKT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #DiffusionModels #DepthEstimation #TransparentObjects #AIResearch
📝 Summary:
Transparent objects are hard for perception. This work observes video diffusion models can synthesize transparent phenomena, so they repurpose one. Their DKT model, trained on a new dataset, achieves zero-shot SOTA for depth and normal estimation of transparent objects, proving diffusion knows tr...
🔹 Publication Date: Published on Dec 29
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.23705
• PDF: https://arxiv.org/pdf/2512.23705
• Project Page: https://daniellli.github.io/projects/DKT/
• Github: https://github.com/Daniellli/DKT
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ComputerVision #DiffusionModels #DepthEstimation #TransparentObjects #AIResearch
✨SpotEdit: Selective Region Editing in Diffusion Transformers
📝 Summary:
SpotEdit is a training-free framework for selective image editing in diffusion transformers. It avoids reprocessing stable regions by reusing their features, combining them with edited areas. This reduces computation and preserves unchanged regions, enhancing efficiency and precision.
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22323
• PDF: https://arxiv.org/pdf/2512.22323
• Project Page: https://biangbiang0321.github.io/SpotEdit.github.io
• Github: https://biangbiang0321.github.io/SpotEdit.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageEditing #DiffusionModels #ComputerVision #AIResearch #DeepLearning
📝 Summary:
SpotEdit is a training-free framework for selective image editing in diffusion transformers. It avoids reprocessing stable regions by reusing their features, combining them with edited areas. This reduces computation and preserves unchanged regions, enhancing efficiency and precision.
🔹 Publication Date: Published on Dec 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22323
• PDF: https://arxiv.org/pdf/2512.22323
• Project Page: https://biangbiang0321.github.io/SpotEdit.github.io
• Github: https://biangbiang0321.github.io/SpotEdit.github.io
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#ImageEditing #DiffusionModels #ComputerVision #AIResearch #DeepLearning
✨Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
📝 Summary:
Dream-VL and Dream-VLA are diffusion-based vision-language and vision-language-action models. They achieve state-of-the-art performance in visual planning and robotic control, surpassing autoregressive baselines via their diffusion backbone's superior action generation.
🔹 Publication Date: Published on Dec 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.22615
• PDF: https://arxiv.org/pdf/2512.22615
• Project Page: https://hkunlp.github.io/blog/2025/dream-vlx/
• Github: https://github.com/DreamLM/Dream-VLX
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#VisionLanguageModels #DiffusionModels #Robotics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
✨GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
📝 Summary:
GaMO improves sparse-view 3D reconstruction by using geometry-aware multi-view outpainting. It expands existing views to enhance scene coverage and consistency. This achieves state-of-the-art quality 25x faster than prior methods, with reduced computational cost.
🔹 Publication Date: Published on Dec 31, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.25073
• PDF: https://arxiv.org/pdf/2512.25073
• Project Page: https://yichuanh.github.io/GaMO/
• Github: https://yichuanh.github.io/GaMO/
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #DiffusionModels #GaMO #AI
✨Guiding a Diffusion Transformer with the Internal Dynamics of Itself
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
📝 Summary:
This paper introduces Internal Guidance IG for diffusion models, which adds auxiliary supervision to intermediate layers during training and extrapolates outputs during sampling. This simple strategy significantly improves training efficiency and generation quality. IG achieves state-of-the-art F...
🔹 Publication Date: Published on Dec 30, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24176
• PDF: https://arxiv.org/pdf/2512.24176
• Project Page: https://zhouxingyu13.github.io/Internal-Guidance/
• Github: https://github.com/CVL-UESTC/Internal-Guidance
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#DiffusionModels #AI #DeepLearning #GenerativeAI #ComputerVision
✨Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
📝 Summary:
DiffusionGS is a novel single-stage 3D diffusion model that directly generates 3D Gaussian point clouds from a single image. It ensures strong view consistency from any prompt view. This method achieves superior quality and is over 5x faster than state-of-the-art techniques.
🔹 Publication Date: Published on Nov 21, 2024
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2411.14384
• PDF: https://arxiv.org/pdf/2411.14384
• Project Page: https://caiyuanhao1998.github.io/project/DiffusionGS/
• Github: https://github.com/caiyuanhao1998/Open-DiffusionGS
🔹 Models citing this paper:
• https://huggingface.co/CaiYuanhao/DiffusionGS
✨ Datasets citing this paper:
• https://huggingface.co/datasets/CaiYuanhao/DiffusionGS
==================================
For more data science resources:
✓ https://t.iss.one/DataScienceT
#3DGeneration #DiffusionModels #GaussianSplatting #ComputerVision #AIResearch
arXiv.org
Baking Gaussian Splatting into Diffusion Denoiser for Fast and...
Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction...