Data Science | Machine Learning with Python for Researchers

🤖🧠 Skyvern: The Future of Browser Automation Powered by AI and Computer Vision

🗓️ 16 Nov 2025
📚 AI News & Trends

In today’s fast-evolving digital landscape, automation plays a crucial role in enhancing productivity, efficiency and innovation. Yet, traditional browser automation tools often struggle with complexity, maintenance and reliability. They rely heavily on DOM parsing, XPaths and rigid scripts that easily break when websites change their layout. Enter Skyvern, an open-source, AI-driven browser automation platform developed ...

#Skyvern #BrowserAutomation #AIDriven #ComputerVision #OpenSource #WebAutomation

❤‍🔥1❤1👍1

406 views13:44

📖 Read More

📣 BEST TELEGRAM CHANNELS

Data Science | Machine Learning with Python for Researchers

✨SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

📝 Summary:
SpatialThinker is a new 3D-aware MLLM that uses RL and dense spatial rewards to significantly improve spatial understanding. It integrates structured spatial grounding and multi-step reasoning, outperforming existing models and GPT-4o on spatial VQA and real-world benchmarks.

🔹 Publication Date: Published on Nov 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.07403
• PDF: https://arxiv.org/pdf/2511.07403
• Github: https://github.com/hunarbatra/SpatialThinker

🔹 Models citing this paper:
• https://huggingface.co/OX-PIXL/SpatialThinker-3B
• https://huggingface.co/OX-PIXL/SpatialThinker-7B

✨ Datasets citing this paper:
• https://huggingface.co/datasets/OX-PIXL/STVQA-7K

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MultimodalLLM #3DReasoning #ReinforcementLearning #AIResearch #ComputerVision

226 views07:02

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

📝 Summary:
WEAVE introduces a suite with a large dataset and benchmark to assess multi-turn context-dependent image generation and editing in multimodal models. It enables new capabilities like visual memory in models while exposing current limitations in these complex tasks.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11434
• PDF: https://arxiv.org/pdf/2511.11434
• Project Page: https://weichow23.github.io/weave/
• Github: https://github.com/weichow23/weave

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MultimodalAI #ImageGeneration #GenerativeAI #ComputerVision #AIResearch

216 views09:04

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

📝 Summary:
RF-DETR is a light-weight detection transformer leveraging weight-sharing NAS to optimize accuracy-latency tradeoffs across diverse datasets. It significantly outperforms prior state-of-the-art, being the first real-time detector to surpass 60 AP on COCO.

🔹 Publication Date: Published on Nov 12

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.09554
• PDF: https://arxiv.org/pdf/2511.09554

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ObjectDetection #ComputerVision #MachineLearning #NeuralArchitectureSearch #Transformers

230 views17:08

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

📝 Summary:
TiViBench is a new benchmark assessing image-to-video models reasoning across four dimensions and 24 tasks. Commercial models show stronger reasoning potential. VideoTPO, a test-time strategy, significantly enhances performance, advancing reasoning in video generation.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13704
• PDF: https://arxiv.org/pdf/2511.13704
• Project Page: https://haroldchen19.github.io/TiViBench-Page/
• Github: https://haroldchen19.github.io/TiViBench-Page/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoGeneration #AIBenchmark #ComputerVision #DeepLearning #AIResearch

121 views05:04

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

2:31

This media is not supported in your browser

VIEW IN TELEGRAM

✨PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

📝 Summary:
PhysX-Anything generates simulation-ready physical 3D assets from single images, crucial for embodied AI. It uses a novel VLM-based model and an efficient 3D representation, enabling direct use in robotic policy learning.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13648
• PDF: https://arxiv.org/pdf/2511.13648
• Project Page: https://physx-anything.github.io/
• Github: https://github.com/ziangcao0312/PhysX-Anything

✨ Datasets citing this paper:
• https://huggingface.co/datasets/Caoza/PhysX-Mobility

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#EmbodiedAI #3DReconstruction #Robotics #ComputerVision #AIResearch

122 views05:04

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

📝 Summary:
Part-X-MLLM is a 3D multimodal large language model that unifies diverse 3D tasks by generating structured programs from RGB point clouds and language prompts. It outputs part-level data and edit commands, enabling state-of-the-art 3D generation and editing through one interface.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13647
• PDF: https://arxiv.org/pdf/2511.13647
• Project Page: https://chunshi.wang/Part-X-MLLM/

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#3D #MLLM #GenerativeAI #ComputerVision #AIResearch

123 views05:04

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Back to Basics: Let Denoising Generative Models Denoise

📝 Summary:
Denoising diffusion models should predict clean images directly, not noise, leveraging the data manifold assumption. The paper introduces JiT, a model using simple, large-patch Transformers that achieves competitive generative results on ImageNet.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13720
• PDF: https://arxiv.org/pdf/2511.13720
• Github: https://github.com/LTH14/JiT

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#DiffusionModels #GenerativeAI #DeepLearning #ComputerVision #AIResearch

❤1

222 views16:09

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

📝 Summary:
UnSAMv2 enables continuous segmentation granularity control for the SAM model without human annotations. It uses self-supervised learning on unlabeled data to discover mask-granularity pairs and a novel control embedding. UnSAMv2 significantly enhances SAM-2s performance across various segmentati...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13714
• PDF: https://arxiv.org/pdf/2511.13714
• Project Page: https://yujunwei04.github.io/UnSAMv2-Project-Page/
• Github: https://github.com/yujunwei04/UnSAMv2

✨ Spaces citing this paper:
• https://huggingface.co/spaces/yujunwei04/UnSAMv2

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#AI #ComputerVision #SelfSupervisedLearning #ImageSegmentation #DeepLearning

146 views22:10

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Error-Driven Scene Editing for 3D Grounding in Large Language Models

📝 Summary:
DEER-3D improves 3D LLM grounding by iteratively editing and retraining models. It diagnoses predicate-level errors, then generates targeted 3D scene edits as counterfactuals to enhance spatial understanding and accuracy.

🔹 Publication Date: Published on Nov 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14086
• PDF: https://arxiv.org/pdf/2511.14086
• Github: https://github.com/zhangyuejoslin/Deer-3D

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#LLMs #3DGrounding #ComputerVision #DeepLearning #AIResearch

121 views03:00

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

📝 Summary:
Orion is a visual agent framework that orchestrates specialized computer vision tools to execute complex visual workflows. It achieves competitive performance on benchmarks and enables autonomous, tool-driven visual reasoning.

🔹 Publication Date: Published on Nov 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14210
• PDF: https://arxiv.org/pdf/2511.14210

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ComputerVision #AIagents #VisualReasoning #MultimodalAI #DeepLearning

143 views03:01

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

📝 Summary:
CoTyle introduces code-to-style image generation, creating consistent visual styles from numerical codes. It is the first open-source academic method for this task, using a discrete style codebook and a text-to-image diffusion model for diverse, reproducible styles.

🔹 Publication Date: Published on Nov 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.10555
• PDF: https://arxiv.org/pdf/2511.10555
• Project Page: https://Kwai-Kolors.github.io/CoTyle/
• Github: https://github.com/Kwai-Kolors/CoTyle

✨ Spaces citing this paper:
• https://huggingface.co/spaces/Kwai-Kolors/CoTyle

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ImageGeneration #DiffusionModels #NeuralStyle #ComputerVision #DeepLearning

109 views04:01

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

📝 Summary:
MVI-Bench introduces a new benchmark to evaluate Large Vision-Language Models robustness against misleading visual inputs. It utilizes a hierarchical taxonomy and a novel metric to uncover significant vulnerabilities in state-of-the-art LVLMs.

🔹 Publication Date: Published on Nov 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14159
• PDF: https://arxiv.org/pdf/2511.14159
• Github: https://github.com/chenyil6/MVI-Bench

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#LVLMs #ComputerVision #AIrobustness #MachineLearning #AI

118 views04:01

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

📝 Summary:
Text-only self-reflection is insufficient for long-form video understanding. REVISOR is a new framework enabling MLLMs to perform multimodal introspective reflection across text and visual modalities. This significantly enhances reasoning for long videos without extra fine-tuning, achieving stron...

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13026
• PDF: https://arxiv.org/pdf/2511.13026

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MultimodalAI #VideoUnderstanding #MLLMs #AIResearch #ComputerVision

116 views04:01

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Φeat: Physically-Grounded Feature Representation

📝 Summary:
Φeat is a new self-supervised visual backbone that captures material identity like reflectance and mesostructure. It learns robust features invariant to external physical factors such as shape and lighting, promoting physics-aware perception.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11270
• PDF: https://arxiv.org/pdf/2511.11270

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#ComputerVision #SelfSupervisedLearning #DeepLearning #FeatureLearning #PhysicsAwareAI

180 views08:21

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨VIDEOP2R: Video Understanding from Perception to Reasoning

📝 Summary:
VideoP2R is a novel reinforcement fine-tuning framework for video understanding. It separately models perception and reasoning processes, using a new CoT dataset and a process-aware RL algorithm. This approach achieves state-of-the-art results on video reasoning benchmarks.

🔹 Publication Date: Published on Nov 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.11113v1
• PDF: https://arxiv.org/pdf/2511.11113

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoUnderstanding #ReinforcementLearning #AIResearch #ComputerVision #Reasoning

197 views22:24

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

📝 Summary:
VR-Bench evaluates video models' spatial reasoning using maze-solving tasks. It demonstrates that video models excel in spatial perception and reasoning, outperforming VLMs, and benefit from diverse sampling during inference. These findings show the strong potential of reasoning via video for spa...

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15065
• PDF: https://arxiv.org/pdf/2511.15065
• Project Page: https://imyangc7.github.io/VRBench_Web/
• Github: https://github.com/ImYangC7/VR-Bench

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoModels #AIReasoning #SpatialAI #ComputerVision #MachineLearning

❤1

130 views03:00

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

📝 Summary:
ARC-Chapter is a large-scale video chaptering model trained on millions of long video chapters, using a new bilingual and hierarchical dataset. It introduces a novel evaluation metric, GRACE, to better reflect real-world chaptering. The model achieves state-of-the-art performance and demonstrates...

🔹 Publication Date: Published on Nov 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.14349
• PDF: https://arxiv.org/pdf/2511.14349
• Project Page: https://arcchapter.github.io/index_en.html
• Github: https://github.com/TencentARC/ARC-Chapter

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#VideoChaptering #AI #MachineLearning #VideoSummarization #ComputerVision

149 views06:02

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨Medal S: Spatio-Textual Prompt Model for Medical Segmentation

📝 Summary:
Medal S is a medical segmentation foundation model using spatio-textual prompts for efficient, high-accuracy multi-class segmentation across diverse modalities. It uniquely aligns volumetric prompts with text embeddings and processes masks in parallel, significantly outperforming prior methods.

🔹 Publication Date: Published on Nov 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.13001
• PDF: https://arxiv.org/pdf/2511.13001
• Github: https://github.com/yinghemedical/Medal-S

🔹 Models citing this paper:
• https://huggingface.co/spc819/Medal-S-V1.0

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#MedicalSegmentation #FoundationModels #AI #DeepLearning #ComputerVision

148 views09:02

✨ Explore Data Science 📝 Write your paper

Data Science | Machine Learning with Python for Researchers

✨OmniParser for Pure Vision Based GUI Agent

📝 Summary:
OmniParser enhances GPT-4V's ability to act as a GUI agent by improving screen parsing. It identifies interactable icons and understands element semantics using specialized models. This significantly boosts GPT-4V's performance on benchmarks like ScreenSpot, Mind2Web, and AITW.

🔹 Publication Date: Published on Aug 1, 2024

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2408.00203
• PDF: https://arxiv.org/pdf/2408.00203
• Github: https://github.com/microsoft/omniparser

🔹 Models citing this paper:
• https://huggingface.co/microsoft/OmniParser
• https://huggingface.co/microsoft/OmniParser-v2.0
• https://huggingface.co/banao-tech/OmniParser

✨ Datasets citing this paper:
• https://huggingface.co/datasets/mlfoundations/Click-100k

✨ Spaces citing this paper:
• https://huggingface.co/spaces/callmeumer/OmniParser-v2
• https://huggingface.co/spaces/nofl/OmniParser-v2
• https://huggingface.co/spaces/SheldonLe/OmniParser-v2

==================================

For more data science resources:
✓ https://t.iss.one/DataScienceT

#GUIagents #ComputerVision #GPT4V #AIagents #DeepLearning

arXiv.org

OmniParser for Pure Vision Based GUI Agent

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as...

242 views09:03

✨ Explore Data Science 📝 Write your paper

About

Blog

Apps

Platform