This media is not supported in your browser
VIEW IN TELEGRAM
π©Code-Agentic Educationπ©
πShow Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MITπ
πReview https://t.ly/Fv4LJ
πPaper https://arxiv.org/pdf/2510.01174
πRepo https://github.com/showlab/Code2Video/
πProject https://showlab.github.io/Code2Video/
πShow Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MITπ
πReview https://t.ly/Fv4LJ
πPaper https://arxiv.org/pdf/2510.01174
πRepo https://github.com/showlab/Code2Video/
πProject https://showlab.github.io/Code2Video/
β€8π₯2
epi_11 (online-video-cutter.com).mp4
1.1 MB
π·π· Clink! Chop! Thud! π·π·
πSounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced π
πReview https://t.ly/VK_1h
πPaper https://lnkd.in/depNjVXm
πProject https://lnkd.in/dF63EZFG
πRepo TBA
πSounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced π
πReview https://t.ly/VK_1h
πPaper https://lnkd.in/depNjVXm
πProject https://lnkd.in/dF63EZFG
πRepo TBA
π₯6β€2π1
π A proof I'm not a bot...
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
β€19π₯8π4π3β‘1
This media is not supported in your browser
VIEW IN TELEGRAM
πΊVisual Grounding RVOSπΊ
πReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpointsπ
πReview https://t.ly/rOdkP
πPaper https://lnkd.in/efuAFQdE
πProject https://lnkd.in/dK3wMZqv
πRepo https://lnkd.in/d3i2PsNF
πReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpointsπ
πReview https://t.ly/rOdkP
πPaper https://lnkd.in/efuAFQdE
πProject https://lnkd.in/dK3wMZqv
πRepo https://lnkd.in/d3i2PsNF
π₯8β€3π1
This media is not supported in your browser
VIEW IN TELEGRAM
πPixel-Perfect Depth (SOTA)π
πPixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0π
πReview https://t.ly/75PGo
πPaper https://lnkd.in/d8wxFpyY
πProject https://lnkd.in/dV5HhsqH
πRepo https://lnkd.in/d9JKFBJq
πDemo https://lnkd.in/d3wBkKJ9
πPixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0π
πReview https://t.ly/75PGo
πPaper https://lnkd.in/d8wxFpyY
πProject https://lnkd.in/dV5HhsqH
πRepo https://lnkd.in/d9JKFBJq
πDemo https://lnkd.in/d3wBkKJ9
π₯16π€―5β€3
This media is not supported in your browser
VIEW IN TELEGRAM
βοΈ TrackVLA++ Visual TrackingβοΈ
πTrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announcedπ
πReview https://t.ly/ruYzc
πPaper https://arxiv.org/pdf/2510.07134
πProject pku-epic.github.io/TrackVLA-plus-plus-Web/
πRepo TBA
πTrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announcedπ
πReview https://t.ly/ruYzc
πPaper https://arxiv.org/pdf/2510.07134
πProject pku-epic.github.io/TrackVLA-plus-plus-Web/
πRepo TBA
π₯5β€1π€£1
This media is not supported in your browser
VIEW IN TELEGRAM
π«§ Detect Anything via MLLM π«§
πRex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0π
πReview https://t.ly/DCTk_
πPaper https://lnkd.in/d4VDD-9j
πProject https://lnkd.in/d6unEyvq
πRepo https://lnkd.in/dkYJFe-x
πRex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0π
πReview https://t.ly/DCTk_
πPaper https://lnkd.in/d4VDD-9j
πProject https://lnkd.in/d6unEyvq
πRepo https://lnkd.in/dkYJFe-x
1π₯19β€10π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π«Universal Feature Up-Samplingπ«
πAnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0π
πReview https://t.ly/HvEw9
πPaper https://arxiv.org/pdf/2510.12764
πProject https://wimmerth.github.io/anyup/
πRepo https://github.com/wimmerth/anyup
πAnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0π
πReview https://t.ly/HvEw9
πPaper https://arxiv.org/pdf/2510.12764
πProject https://wimmerth.github.io/anyup/
πRepo https://github.com/wimmerth/anyup
β€14π₯5π1
This media is not supported in your browser
VIEW IN TELEGRAM
π¦ City-Tour -> Simulation π¦
πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced π
πReview https://t.ly/UvXNS
πPaper https://arxiv.org/pdf/2510.15018
πProject https://urbanverseproject.github.io/
πRepo TBA
πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced π
πReview https://t.ly/UvXNS
πPaper https://arxiv.org/pdf/2510.15018
πProject https://urbanverseproject.github.io/
πRepo TBA
β€11π€©2π1π₯1π’1
π΅All-in-One Dense Keypointsπ΅
πDeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MITπ
πReview https://t.ly/VKGct
πPaper https://arxiv.org/pdf/2510.17422
πRepo https://github.com/saktx/DeepDetect
πDeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MITπ
πReview https://t.ly/VKGct
πPaper https://arxiv.org/pdf/2510.17422
πRepo https://github.com/saktx/DeepDetect
β€14π₯3π1
This media is not supported in your browser
VIEW IN TELEGRAM
π₯ SAM 2++: Track Anything π₯
πSAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announcedπ’
πReview https://t.ly/I392_
πPaper arxiv.org/pdf/2510.18822
πProject tracking-any-granularity.github.io/
πRepo :(
πSAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announcedπ’
πReview https://t.ly/I392_
πPaper arxiv.org/pdf/2510.18822
πProject tracking-any-granularity.github.io/
πRepo :(
β€12π₯6π1
AI with Papers - Artificial Intelligence & Deep Learning
π¦ City-Tour -> Simulation π¦ πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repoβ¦
Repo (pretty empty) now online: https://github.com/OatmealLiu/UrbanVerse
GitHub
GitHub - OatmealLiu/UrbanVerse: Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physicallyβ¦
Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physically-Accurate Assets Γ Real-World City-Tour Layouts) - OatmealLiu/UrbanVerse
β€4
This media is not supported in your browser
VIEW IN TELEGRAM
ποΈOmni Driving ModelsποΈ
πOmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0π
πReview https://t.ly/ktXvz
πPaper https://lnkd.in/eFKSZnrc
πProject https://lnkd.in/eSDfccv8
πRepo https://lnkd.in/efCSvjtp
πOmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0π
πReview https://t.ly/ktXvz
πPaper https://lnkd.in/eFKSZnrc
πProject https://lnkd.in/eSDfccv8
πRepo https://lnkd.in/efCSvjtp
π₯6β€1π1π€©1
This media is not supported in your browser
VIEW IN TELEGRAM
π ITTO: Protocol for Dynamic Trackingπ
πITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0π
πReview https://t.ly/tN84a
πPaper https://arxiv.org/pdf/2510.19819
πProject https://glab-caltech.github.io/ITTO/
πRepo https://github.com/ilonadem/itto
πITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0π
πReview https://t.ly/tN84a
πPaper https://arxiv.org/pdf/2510.19819
πProject https://glab-caltech.github.io/ITTO/
πRepo https://github.com/ilonadem/itto
β€5π₯1
This media is not supported in your browser
VIEW IN TELEGRAM
π¦Character Mixing Generationπ¦
πMBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
πReview https://t.ly/tN84a
πPaper https://lnkd.in/dhKMwukv
πProject https://lnkd.in/dBkJs48h
πRepo https://lnkd.in/dw_uzgAk
πMBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
πReview https://t.ly/tN84a
πPaper https://lnkd.in/dhKMwukv
πProject https://lnkd.in/dBkJs48h
πRepo https://lnkd.in/dw_uzgAk
π€©4β€1π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π§·Generative Point Tracking w/ FMπ§·
πGenerative Point Tracker (GenPT) is a novel generative framework for modelling multi-modal trajectories. Able to capture the multi-modality in point trajectories. Repo under MITπ
πReview https://t.ly/MMFrt
πPaper https://arxiv.org/pdf/2510.20951
πProject mtesfaldet.net/genpt_projpage/
πRepo https://github.com/tesfaldet/genpt
πGenerative Point Tracker (GenPT) is a novel generative framework for modelling multi-modal trajectories. Able to capture the multi-modality in point trajectories. Repo under MITπ
πReview https://t.ly/MMFrt
πPaper https://arxiv.org/pdf/2510.20951
πProject mtesfaldet.net/genpt_projpage/
πRepo https://github.com/tesfaldet/genpt
π₯7β€1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π¦Unified Region-Level MLLMπ¦
πPixeRefers is an unified multimodal LLM framework that supports precise, region-specific understanding in both static images and dynamic videos, overcoming the holistic, scene-level bias of prior MLLMs. SOTA results. Demo, Repo & Dataset availableπ
πReview https://t.ly/WH4dQ
πPaper arxiv.org/pdf/2510.23603
πProject circleradon.github.io/PixelRefer
πRepo https://github.com/alibaba-damo-academy/PixelRefer
πPixeRefers is an unified multimodal LLM framework that supports precise, region-specific understanding in both static images and dynamic videos, overcoming the holistic, scene-level bias of prior MLLMs. SOTA results. Demo, Repo & Dataset availableπ
πReview https://t.ly/WH4dQ
πPaper arxiv.org/pdf/2510.23603
πProject circleradon.github.io/PixelRefer
πRepo https://github.com/alibaba-damo-academy/PixelRefer
π₯3β€2π€―2
This media is not supported in your browser
VIEW IN TELEGRAM
π±PlanarTrack: Large Planar Trackingπ±
πPlanarTrack is a large-scale HQ and challenging benchmark for planar tracking: 1,150 sequences with 733K+ frames, including 1,000 short-term & 150 long-term videos. Repo & Dataset availableπ
πReview https://t.ly/mYNi7
πPaper arxiv.org/pdf/2510.23368
πRepo https://lnkd.in/edb3GMyT
πProject https://lnkd.in/eC-hVB-U
πData https://lnkd.in/eew2j4tM
πPlanarTrack is a large-scale HQ and challenging benchmark for planar tracking: 1,150 sequences with 733K+ frames, including 1,000 short-term & 150 long-term videos. Repo & Dataset availableπ
πReview https://t.ly/mYNi7
πPaper arxiv.org/pdf/2510.23368
πRepo https://lnkd.in/edb3GMyT
πProject https://lnkd.in/eC-hVB-U
πData https://lnkd.in/eew2j4tM
π₯8β€2π1