β¨Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
π Summary:
Orion is a visual agent framework that orchestrates specialized computer vision tools to execute complex visual workflows. It achieves competitive performance on benchmarks and enables autonomous, tool-driven visual reasoning.
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14210
β’ PDF: https://arxiv.org/pdf/2511.14210
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ComputerVision #AIagents #VisualReasoning #MultimodalAI #DeepLearning
π Summary:
Orion is a visual agent framework that orchestrates specialized computer vision tools to execute complex visual workflows. It achieves competitive performance on benchmarks and enables autonomous, tool-driven visual reasoning.
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14210
β’ PDF: https://arxiv.org/pdf/2511.14210
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ComputerVision #AIagents #VisualReasoning #MultimodalAI #DeepLearning
β¨A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
π Summary:
CoTyle introduces code-to-style image generation, creating consistent visual styles from numerical codes. It is the first open-source academic method for this task, using a discrete style codebook and a text-to-image diffusion model for diverse, reproducible styles.
πΉ Publication Date: Published on Nov 13
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.10555
β’ PDF: https://arxiv.org/pdf/2511.10555
β’ Project Page: https://Kwai-Kolors.github.io/CoTyle/
β’ Github: https://github.com/Kwai-Kolors/CoTyle
β¨ Spaces citing this paper:
β’ https://huggingface.co/spaces/Kwai-Kolors/CoTyle
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ImageGeneration #DiffusionModels #NeuralStyle #ComputerVision #DeepLearning
π Summary:
CoTyle introduces code-to-style image generation, creating consistent visual styles from numerical codes. It is the first open-source academic method for this task, using a discrete style codebook and a text-to-image diffusion model for diverse, reproducible styles.
πΉ Publication Date: Published on Nov 13
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.10555
β’ PDF: https://arxiv.org/pdf/2511.10555
β’ Project Page: https://Kwai-Kolors.github.io/CoTyle/
β’ Github: https://github.com/Kwai-Kolors/CoTyle
β¨ Spaces citing this paper:
β’ https://huggingface.co/spaces/Kwai-Kolors/CoTyle
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ImageGeneration #DiffusionModels #NeuralStyle #ComputerVision #DeepLearning
β¨MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
π Summary:
MVI-Bench introduces a new benchmark to evaluate Large Vision-Language Models robustness against misleading visual inputs. It utilizes a hierarchical taxonomy and a novel metric to uncover significant vulnerabilities in state-of-the-art LVLMs.
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14159
β’ PDF: https://arxiv.org/pdf/2511.14159
β’ Github: https://github.com/chenyil6/MVI-Bench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#LVLMs #ComputerVision #AIrobustness #MachineLearning #AI
π Summary:
MVI-Bench introduces a new benchmark to evaluate Large Vision-Language Models robustness against misleading visual inputs. It utilizes a hierarchical taxonomy and a novel metric to uncover significant vulnerabilities in state-of-the-art LVLMs.
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14159
β’ PDF: https://arxiv.org/pdf/2511.14159
β’ Github: https://github.com/chenyil6/MVI-Bench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#LVLMs #ComputerVision #AIrobustness #MachineLearning #AI
β¨REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
π Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13026
β’ PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
π Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13026
β’ PDF: https://arxiv.org/pdf/2511.13026
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision
β¨Ξ¦eat: Physically-Grounded Feature Representation
π Summary:
Ξ¦eat is a new self-supervised visual backbone that captures material identity like reflectance and mesostructure. It learns robust features invariant to external physical factors such as shape and lighting, promoting physics-aware perception.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11270
β’ PDF: https://arxiv.org/pdf/2511.11270
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ComputerVision #SelfSupervisedLearning #DeepLearning #FeatureLearning #PhysicsAwareAI
π Summary:
Ξ¦eat is a new self-supervised visual backbone that captures material identity like reflectance and mesostructure. It learns robust features invariant to external physical factors such as shape and lighting, promoting physics-aware perception.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11270
β’ PDF: https://arxiv.org/pdf/2511.11270
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#ComputerVision #SelfSupervisedLearning #DeepLearning #FeatureLearning #PhysicsAwareAI
β¨VIDEOP2R: Video Understanding from Perception to Reasoning
π Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11113v1
β’ PDF: https://arxiv.org/pdf/2511.11113
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
π Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11113v1
β’ PDF: https://arxiv.org/pdf/2511.11113
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning
β¨Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
π Summary:
VR-Bench evaluates video models' spatial reasoning using maze-solving tasks. It demonstrates that video models excel in spatial perception and reasoning, outperforming VLMs, and benefit from diverse sampling during inference. These findings show the strong potential of reasoning via video for spa...
πΉ Publication Date: Published on Nov 19
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15065
β’ PDF: https://arxiv.org/pdf/2511.15065
β’ Project Page: https://imyangc7.github.io/VRBench_Web/
β’ Github: https://github.com/ImYangC7/VR-Bench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoModels #AIReasoning #SpatialAI #ComputerVision #MachineLearning
π Summary:
VR-Bench evaluates video models' spatial reasoning using maze-solving tasks. It demonstrates that video models excel in spatial perception and reasoning, outperforming VLMs, and benefit from diverse sampling during inference. These findings show the strong potential of reasoning via video for spa...
πΉ Publication Date: Published on Nov 19
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15065
β’ PDF: https://arxiv.org/pdf/2511.15065
β’ Project Page: https://imyangc7.github.io/VRBench_Web/
β’ Github: https://github.com/ImYangC7/VR-Bench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoModels #AIReasoning #SpatialAI #ComputerVision #MachineLearning
β€1
β¨ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
π Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14349
β’ PDF: https://arxiv.org/pdf/2511.14349
β’ Project Page: https://arcchapter.github.io/index_en.html
β’ Github: https://github.com/TencentARC/ARC-Chapter
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision
π Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...
πΉ Publication Date: Published on Nov 18
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.14349
β’ PDF: https://arxiv.org/pdf/2511.14349
β’ Project Page: https://arcchapter.github.io/index_en.html
β’ Github: https://github.com/TencentARC/ARC-Chapter
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision
β¨Medal S: Spatio-Textual Prompt Model for Medical Segmentation
π Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13001
β’ PDF: https://arxiv.org/pdf/2511.13001
β’ Github: https://github.com/yinghemedical/Medal-S
πΉ Models citing this paper:
β’ https://huggingface.co/spc819/Medal-S-V1.0
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision
π Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13001
β’ PDF: https://arxiv.org/pdf/2511.13001
β’ Github: https://github.com/yinghemedical/Medal-S
πΉ Models citing this paper:
β’ https://huggingface.co/spc819/Medal-S-V1.0
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision
β¨OmniParser for Pure Vision Based GUI Agent
π Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.
πΉ Publication Date: Published on Aug 1, 2024
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2408.00203
β’ PDF: https://arxiv.org/pdf/2408.00203
β’ Github: https://github.com/microsoft/omniparser
πΉ Models citing this paper:
β’ https://huggingface.co/microsoft/OmniParser
β’ https://huggingface.co/microsoft/OmniParser-v2.0
β’ https://huggingface.co/banao-tech/OmniParser
β¨ Datasets citing this paper:
β’ https://huggingface.co/datasets/mlfoundations/Click-100k
β¨ Spaces citing this paper:
β’ https://huggingface.co/spaces/callmeumer/OmniParser-v2
β’ https://huggingface.co/spaces/nofl/OmniParser-v2
β’ https://huggingface.co/spaces/SheldonLe/OmniParser-v2
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning
π Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.
πΉ Publication Date: Published on Aug 1, 2024
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2408.00203
β’ PDF: https://arxiv.org/pdf/2408.00203
β’ Github: https://github.com/microsoft/omniparser
πΉ Models citing this paper:
β’ https://huggingface.co/microsoft/OmniParser
β’ https://huggingface.co/microsoft/OmniParser-v2.0
β’ https://huggingface.co/banao-tech/OmniParser
β¨ Datasets citing this paper:
β’ https://huggingface.co/datasets/mlfoundations/Click-100k
β¨ Spaces citing this paper:
β’ https://huggingface.co/spaces/callmeumer/OmniParser-v2
β’ https://huggingface.co/spaces/nofl/OmniParser-v2
β’ https://huggingface.co/spaces/SheldonLe/OmniParser-v2
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning
arXiv.org
OmniParser for Pure Vision Based GUI Agent
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as...
β¨Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
π Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16669
β’ PDF: https://arxiv.org/pdf/2511.16669
β’ Project Page: https://video-as-answer.github.io/
β’ Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
π Summary:
VANS is a new model for Video-Next-Event Prediction VNEP that generates dynamic, visually and semantically accurate video responses. It uses reinforcement learning to align a Vision-Language Model with a Video Diffusion Model, achieving state-of-the-art performance.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16669
β’ PDF: https://arxiv.org/pdf/2511.16669
β’ Project Page: https://video-as-answer.github.io/
β’ Github: https://github.com/KlingTeam/VANS
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoAI #GenerativeAI #MachineLearning #ComputerVision #DeepLearning
β¨Scaling Spatial Intelligence with Multimodal Foundation Models
π Summary:
SenseNova-SI is a new scaled multimodal foundation model that achieves superior spatial intelligence. By using 8 million diverse data samples, it sets unprecedented performance on various spatial benchmarks. The models are publicly released to foster further research.
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13719
β’ PDF: https://arxiv.org/pdf/2511.13719
β’ Project Page: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
β’ Github: https://github.com/OpenSenseNova/SenseNova-SI
πΉ Models citing this paper:
β’ https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B
β’ https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B
β’ https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MultimodalAI #FoundationModels #SpatialIntelligence #ComputerVision #AI
π Summary:
SenseNova-SI is a new scaled multimodal foundation model that achieves superior spatial intelligence. By using 8 million diverse data samples, it sets unprecedented performance on various spatial benchmarks. The models are publicly released to foster further research.
πΉ Publication Date: Published on Nov 17
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.13719
β’ PDF: https://arxiv.org/pdf/2511.13719
β’ Project Page: https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B
β’ Github: https://github.com/OpenSenseNova/SenseNova-SI
πΉ Models citing this paper:
β’ https://huggingface.co/sensenova/SenseNova-SI-InternVL3-8B
β’ https://huggingface.co/sensenova/SenseNova-SI-InternVL3-2B
β’ https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MultimodalAI #FoundationModels #SpatialIntelligence #ComputerVision #AI
arXiv.org
Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to...
β¨First Frame Is the Place to Go for Video Content Customization
π Summary:
The first frame in video generation models functions as a conceptual memory buffer, storing visual elements for later reuse. This enables robust video content customization with minimal training examples, without major model changes.
πΉ Publication Date: Published on Nov 19
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15700
β’ PDF: https://arxiv.org/pdf/2511.15700
β’ Project Page: https://firstframego.github.io
β’ Github: https://firstframego.github.io
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #DeepLearning #AICustomization
π Summary:
The first frame in video generation models functions as a conceptual memory buffer, storing visual elements for later reuse. This enables robust video content customization with minimal training examples, without major model changes.
πΉ Publication Date: Published on Nov 19
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15700
β’ PDF: https://arxiv.org/pdf/2511.15700
β’ Project Page: https://firstframego.github.io
β’ Github: https://firstframego.github.io
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#VideoGeneration #GenerativeAI #ComputerVision #DeepLearning #AICustomization
β¨SAM 3D: 3Dfy Anything in Images
π Summary:
SAM 3D reconstructs 3D objects from single images, predicting geometry, texture, and layout. It uses a multi-stage training framework with synthetic pretraining and real-world alignment, breaking the 3D data barrier and achieving high human preference.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16624
β’ PDF: https://arxiv.org/pdf/2511.16624
β’ Project Page: https://ai.meta.com/sam3d/
β’ Github: https://github.com/facebookresearch/sam-3d-objects
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #AI #DeepLearning #SingleImage3D
π Summary:
SAM 3D reconstructs 3D objects from single images, predicting geometry, texture, and layout. It uses a multi-stage training framework with synthetic pretraining and real-world alignment, breaking the 3D data barrier and achieving high human preference.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16624
β’ PDF: https://arxiv.org/pdf/2511.16624
β’ Project Page: https://ai.meta.com/sam3d/
β’ Github: https://github.com/facebookresearch/sam-3d-objects
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#3DReconstruction #ComputerVision #AI #DeepLearning #SingleImage3D
β¨Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
π Summary:
Thinking-while-Generating TwiG interleaves textual reasoning throughout the visual generation process. This on-the-fly multimodal interaction guides and reflects on visual content as it is created, resulting in more context-aware and semantically rich outputs.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16671
β’ PDF: https://arxiv.org/pdf/2511.16671
β’ Project Page: https://think-while-gen.github.io/
β’ Github: https://github.com/ZiyuGuo99/Thinking-while-Generating
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#GenerativeAI #MultimodalAI #ComputerVision #NLP #AIResearch
π Summary:
Thinking-while-Generating TwiG interleaves textual reasoning throughout the visual generation process. This on-the-fly multimodal interaction guides and reflects on visual content as it is created, resulting in more context-aware and semantically rich outputs.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16671
β’ PDF: https://arxiv.org/pdf/2511.16671
β’ Project Page: https://think-while-gen.github.io/
β’ Github: https://github.com/ZiyuGuo99/Thinking-while-Generating
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#GenerativeAI #MultimodalAI #ComputerVision #NLP #AIResearch
Media is too big
VIEW IN TELEGRAM
β¨SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
π Summary:
SAM2S is a foundation model enhancing interactive video object segmentation in surgery. It leverages a new large benchmark, robust memory, and temporal learning to achieve superior accuracy 80.42 J and F and real-time performance in surgical video analysis.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16618
β’ PDF: https://arxiv.org/pdf/2511.16618
β’ Project Page: https://jinlab-imvr.github.io/SAM2S
β’ Github: https://github.com/jinlab-imvr/SAM2S
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#SurgicalAI #MedicalImaging #ComputerVision #FoundationModels #DeepLearning
π Summary:
SAM2S is a foundation model enhancing interactive video object segmentation in surgery. It leverages a new large benchmark, robust memory, and temporal learning to achieve superior accuracy 80.42 J and F and real-time performance in surgical video analysis.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16618
β’ PDF: https://arxiv.org/pdf/2511.16618
β’ Project Page: https://jinlab-imvr.github.io/SAM2S
β’ Github: https://github.com/jinlab-imvr/SAM2S
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#SurgicalAI #MedicalImaging #ComputerVision #FoundationModels #DeepLearning
β€1
β¨NaTex: Seamless Texture Generation as Latent Color Diffusion
π Summary:
NaTex directly generates 3D textures using latent color diffusion and geometry-aware models. It predicts texture color in 3D space, outperforming prior methods in coherence and alignment by avoiding 2D multi-view limitations.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16317
β’ PDF: https://arxiv.org/pdf/2511.16317
β’ Project Page: https://natex-ldm.github.io/
β’ Github: https://natex-ldm.github.io/
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#TextureGeneration #DiffusionModels #3DGraphics #ComputerVision #DeepLearning
π Summary:
NaTex directly generates 3D textures using latent color diffusion and geometry-aware models. It predicts texture color in 3D space, outperforming prior methods in coherence and alignment by avoiding 2D multi-view limitations.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16317
β’ PDF: https://arxiv.org/pdf/2511.16317
β’ Project Page: https://natex-ldm.github.io/
β’ Github: https://natex-ldm.github.io/
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#TextureGeneration #DiffusionModels #3DGraphics #ComputerVision #DeepLearning
β¨Draft and Refine with Visual Experts
π Summary:
The Draft and Refine DnR framework improves visual grounding in LVLMs. It uses a novel question-conditioned utilization metric to measure visual evidence reliance. DnR refines responses with external visual experts, reducing hallucinations and boosting accuracy.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11005
β’ PDF: https://arxiv.org/pdf/2511.11005
β’ Github: https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#LVLMs #VisualGrounding #AIHallucinations #ComputerVision #DeepLearning
π Summary:
The Draft and Refine DnR framework improves visual grounding in LVLMs. It uses a novel question-conditioned utilization metric to measure visual evidence reliance. DnR refines responses with external visual experts, reducing hallucinations and boosting accuracy.
πΉ Publication Date: Published on Nov 14
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.11005
β’ PDF: https://arxiv.org/pdf/2511.11005
β’ Github: https://github.com/EavnJeong/Draft-and-Refine-with-Visual-Experts
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#LVLMs #VisualGrounding #AIHallucinations #ComputerVision #DeepLearning
β¨BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks
π Summary:
ImageNet accuracy poorly predicts performance on scientific imagery. BioBench is a new ecology vision benchmark unifying diverse tasks, kingdoms, and modalities with 3.1M images, offering a better evaluation for scientific ML.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16315
β’ PDF: https://arxiv.org/pdf/2511.16315
β’ Project Page: https://samuelstevens.me/biobench
β’ Github: https://github.com/samuelstevens/biobench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#BioBench #MachineLearning #ComputerVision #ScientificML #Ecology
π Summary:
ImageNet accuracy poorly predicts performance on scientific imagery. BioBench is a new ecology vision benchmark unifying diverse tasks, kingdoms, and modalities with 3.1M images, offering a better evaluation for scientific ML.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.16315
β’ PDF: https://arxiv.org/pdf/2511.16315
β’ Project Page: https://samuelstevens.me/biobench
β’ Github: https://github.com/samuelstevens/biobench
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#BioBench #MachineLearning #ComputerVision #ScientificML #Ecology
β€1
β¨Boosting Medical Visual Understanding From Multi-Granular Language Learning
π Summary:
MGLL enhances visual understanding by improving multi-label and cross-granularity alignment in image-text pretraining, outperforming existing methods in complex domains like medical imaging.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15943
β’ PDF: https://arxiv.org/pdf/2511.15943
β’ Project Page: https://github.com/HUANGLIZI/MGLL
β’ Github: https://github.com/HUANGLIZI/MGLL
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MedicalAI #ComputerVision #DeepLearning #NLP #ImageTextPretraining
π Summary:
MGLL enhances visual understanding by improving multi-label and cross-granularity alignment in image-text pretraining, outperforming existing methods in complex domains like medical imaging.
πΉ Publication Date: Published on Nov 20
πΉ Paper Links:
β’ arXiv Page: https://arxiv.org/abs/2511.15943
β’ PDF: https://arxiv.org/pdf/2511.15943
β’ Project Page: https://github.com/HUANGLIZI/MGLL
β’ Github: https://github.com/HUANGLIZI/MGLL
==================================
For more data science resources:
β https://t.iss.one/DataScienceT
#MedicalAI #ComputerVision #DeepLearning #NLP #ImageTextPretraining
β€2