1_mR__xCOs_j8c5A0hMq3B5Q.gif
    21.7 MB
  π₯SOTA Detection w/ DINOv3π₯
πDEIMv2 is the evolution of DEIM framework while leveraging DINOv3. Various models, from an ultra-light version up to S, M, L, & X for a wide range of scenarios. Across these variants, DEIMv2 achieves SOTA. Repo Apache2.0π
πReview https://t.ly/P7jEH
πPaper arxiv.org/pdf/2509.20787
πRepo github.com/Intellindust-AI-Lab/DEIMv2
πProject intellindust-ai-lab.github.io/projects/DEIMv2
πDEIMv2 is the evolution of DEIM framework while leveraging DINOv3. Various models, from an ultra-light version up to S, M, L, & X for a wide range of scenarios. Across these variants, DEIMv2 achieves SOTA. Repo Apache2.0π
πReview https://t.ly/P7jEH
πPaper arxiv.org/pdf/2509.20787
πRepo github.com/Intellindust-AI-Lab/DEIMv2
πProject intellindust-ai-lab.github.io/projects/DEIMv2
π₯12β€6π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π€Real-time Interactive Videoπ€
πLONGLIVE by #Nvidia is a frame-level autoregressive framework for real-time & interactive long video generation. LONGLIVE accepts sequential user prompts and generates corresponding videos in real time. Repo under non-commercial licenseπ
πReview https://t.ly/jJkdY
πPaper arxiv.org/pdf/2509.22622
πProject nvlabs.github.io/LongLive/
πRepo github.com/NVlabs/LongLive
π€huggingface.co/Efficient-Large-Model/LongLive-1.3B
πLONGLIVE by #Nvidia is a frame-level autoregressive framework for real-time & interactive long video generation. LONGLIVE accepts sequential user prompts and generates corresponding videos in real time. Repo under non-commercial licenseπ
πReview https://t.ly/jJkdY
πPaper arxiv.org/pdf/2509.22622
πProject nvlabs.github.io/LongLive/
πRepo github.com/NVlabs/LongLive
π€huggingface.co/Efficient-Large-Model/LongLive-1.3B
π₯8β€1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π Universal Image Restoration π
πLucidFlux by HKUSTGZ is the universal image restoration framework built on a large-scale diffusion transformer that delivers photorealistic restorations of real-world low-quality (LQ) images, outperforming SOTA diffusion-based models across diverse degradations. Repo under custom Non-Commercial Licenseπ
πReview https://t.ly/Z5cA3
πPaper https://arxiv.org/pdf/2509.22414
πProject https://w2genai-lab.github.io/LucidFlux/
πRepo https://github.com/W2GenAI-Lab/LucidFlux
πLucidFlux by HKUSTGZ is the universal image restoration framework built on a large-scale diffusion transformer that delivers photorealistic restorations of real-world low-quality (LQ) images, outperforming SOTA diffusion-based models across diverse degradations. Repo under custom Non-Commercial Licenseπ
πReview https://t.ly/Z5cA3
πPaper https://arxiv.org/pdf/2509.22414
πProject https://w2genai-lab.github.io/LucidFlux/
πRepo https://github.com/W2GenAI-Lab/LucidFlux
π₯14β€4π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π©βπ¦±Physical-Hair Diffusionπ©βπ¦±
πCONTROLHAIR is novel hybrid framework that integrates a physics simulator with conditional video diffusion to enable controllable dynamic hair rendering. Repo announcedπ
πReview https://t.ly/78LHr
πPaper https://lnkd.in/epm-A9Fq
πProject https://lnkd.in/evsjz298
πRepo TBA
πCONTROLHAIR is novel hybrid framework that integrates a physics simulator with conditional video diffusion to enable controllable dynamic hair rendering. Repo announcedπ
πReview https://t.ly/78LHr
πPaper https://lnkd.in/epm-A9Fq
πProject https://lnkd.in/evsjz298
πRepo TBA
β€7π₯2π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π©Code-Agentic Educationπ©
πShow Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MITπ
πReview https://t.ly/Fv4LJ
πPaper https://arxiv.org/pdf/2510.01174
πRepo https://github.com/showlab/Code2Video/
πProject https://showlab.github.io/Code2Video/
πShow Lab unveils Code2Video: agentic, code-centric framework that generates HQ educational videos from knowledge points. Clarity, coherence & reproducibility. Repo under MITπ
πReview https://t.ly/Fv4LJ
πPaper https://arxiv.org/pdf/2510.01174
πRepo https://github.com/showlab/Code2Video/
πProject https://showlab.github.io/Code2Video/
β€8π₯2
  epi_11 (online-video-cutter.com).mp4
    1.1 MB
  π·π· Clink! Chop! Thud! π·π·
πSounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced π
πReview https://t.ly/VK_1h
πPaper https://lnkd.in/depNjVXm
πProject https://lnkd.in/dF63EZFG
πRepo TBA
πSounding Object Detection: while an environment may contain many objects, only a few are directly involved in producing sound during an interaction. This model detects the sounding object in a video. Code/Data announced π
πReview https://t.ly/VK_1h
πPaper https://lnkd.in/depNjVXm
πProject https://lnkd.in/dF63EZFG
πRepo TBA
π₯6β€2π1
  π A proof I'm not a bot...
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
My (short) interview to one of the biggest Italian media: AI in 2016, HPC / Quantum and how I created my startup: https://www.linkedin.com/posts/visionarynet_ai-itw25-ai-activity-7381215486115643392-t7an
Thanks for the support (and of course a new paper coming in a few hours)
β€19π₯8π4π3β‘1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πΊVisual Grounding RVOSπΊ
πReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpointsπ
πReview https://t.ly/rOdkP
πPaper https://lnkd.in/efuAFQdE
πProject https://lnkd.in/dK3wMZqv
πRepo https://lnkd.in/d3i2PsNF
πReferDINO is a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception & cross-modal spatio-temporal reasoning. Code, Demo & checkpointsπ
πReview https://t.ly/rOdkP
πPaper https://lnkd.in/efuAFQdE
πProject https://lnkd.in/dK3wMZqv
πRepo https://lnkd.in/d3i2PsNF
π₯8β€3π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πPixel-Perfect Depth (SOTA)π
πPixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0π
πReview https://t.ly/75PGo
πPaper https://lnkd.in/d8wxFpyY
πProject https://lnkd.in/dV5HhsqH
πRepo https://lnkd.in/d9JKFBJq
πDemo https://lnkd.in/d3wBkKJ9
πPixel-Perfect Depth is a mono-depth estimation model with pixel-space diffusion transformers. New SOTA. Repo under Apache 2.0π
πReview https://t.ly/75PGo
πPaper https://lnkd.in/d8wxFpyY
πProject https://lnkd.in/dV5HhsqH
πRepo https://lnkd.in/d9JKFBJq
πDemo https://lnkd.in/d3wBkKJ9
π₯16π€―5β€3
  This media is not supported in your browser
    VIEW IN TELEGRAM
  βοΈ TrackVLA++ Visual TrackingβοΈ
πTrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announcedπ
πReview https://t.ly/ruYzc
πPaper https://arxiv.org/pdf/2510.07134
πProject pku-epic.github.io/TrackVLA-plus-plus-Web/
πRepo TBA
πTrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling SOTA performance in both long-horizon and highly crowded tracking scenarios. Model announcedπ
πReview https://t.ly/ruYzc
πPaper https://arxiv.org/pdf/2510.07134
πProject pku-epic.github.io/TrackVLA-plus-plus-Web/
πRepo TBA
π₯5β€1π€£1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π«§ Detect Anything via MLLM π«§
πRex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0π
πReview https://t.ly/DCTk_
πPaper https://lnkd.in/d4VDD-9j
πProject https://lnkd.in/d6unEyvq
πRepo https://lnkd.in/dkYJFe-x
πRex-Omni is a 3B-multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, key-pointing & visual prompting into a single next point prediction framework. Impressive results. Repo under IDEA License 1.0π
πReview https://t.ly/DCTk_
πPaper https://lnkd.in/d4VDD-9j
πProject https://lnkd.in/d6unEyvq
πRepo https://lnkd.in/dkYJFe-x
1π₯19β€10π1π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π«Universal Feature Up-Samplingπ«
πAnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0π
πReview https://t.ly/HvEw9
πPaper https://arxiv.org/pdf/2510.12764
πProject https://wimmerth.github.io/anyup/
πRepo https://github.com/wimmerth/anyup
πAnyUp is a novel method for feature up-sampling that can be applied to ANY vision feature at ANY resolution, without encoder-specific training: inference-time feature-agnostic up-sampling architecture to improve up-sampling quality. Repo CC-4.0π
πReview https://t.ly/HvEw9
πPaper https://arxiv.org/pdf/2510.12764
πProject https://wimmerth.github.io/anyup/
πRepo https://github.com/wimmerth/anyup
β€14π₯6π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦ City-Tour -> Simulation π¦
πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced π
πReview https://t.ly/UvXNS
πPaper https://arxiv.org/pdf/2510.15018
πProject https://urbanverseproject.github.io/
πRepo TBA
πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repo & Data announced π
πReview https://t.ly/UvXNS
πPaper https://arxiv.org/pdf/2510.15018
πProject https://urbanverseproject.github.io/
πRepo TBA
β€11π€©2π1π₯1π’1
  π΅All-in-One Dense Keypointsπ΅
πDeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MITπ
πReview https://t.ly/VKGct
πPaper https://arxiv.org/pdf/2510.17422
πRepo https://github.com/saktx/DeepDetect
πDeepDetect is a novel all-in-one, dense keypoints detector that unifies the strengths of SIFT, ORB, BRISK, FAST, AGAST, Harris, Shi-Tomasi, Canny & Sobel into a neural net. DAMN ROMANTIC. Repo under MITπ
πReview https://t.ly/VKGct
πPaper https://arxiv.org/pdf/2510.17422
πRepo https://github.com/saktx/DeepDetect
β€14π₯3π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π₯ SAM 2++: Track Anything π₯
πSAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announcedπ’
πReview https://t.ly/I392_
πPaper arxiv.org/pdf/2510.18822
πProject tracking-any-granularity.github.io/
πRepo :(
πSAM 2++ is a novel unified model towards tracking at any granularity, including masks, boxes, and points. Impressive results but no code announcedπ’
πReview https://t.ly/I392_
πPaper arxiv.org/pdf/2510.18822
πProject tracking-any-granularity.github.io/
πRepo :(
β€12π₯6π1
  
  AI with Papers - Artificial Intelligence & Deep Learning
π¦ City-Tour -> Simulation π¦  πUrbanVerse is a novel system to convert real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments, enabling scalable robot learning in urban spaces with real-world generalization. Repoβ¦
Repo (pretty empty) now online: https://github.com/OatmealLiu/UrbanVerse
  
  GitHub
  
  GitHub - OatmealLiu/UrbanVerse: Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physicallyβ¦
  Scaling Urban Simulation - Infinite Physically-Plausible Urban Simulation = IsaacSim(Physically-Accurate Assets Γ Real-World City-Tour Layouts) - OatmealLiu/UrbanVerse
β€4
  This media is not supported in your browser
    VIEW IN TELEGRAM
  ποΈOmni Driving ModelsποΈ
πOmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0π
πReview https://t.ly/ktXvz
πPaper https://lnkd.in/eFKSZnrc
πProject https://lnkd.in/eSDfccv8
πRepo https://lnkd.in/efCSvjtp
πOmniNWM is a unified panoramic navigation world model that advances autonomous driving by jointly generating multi-modal states (RGB, semantics, depth, 3D occupancy), enabling precise action control & facilitating closed-loop evaluation through occupancy-based dense rewards. Repo under Apache 2.0π
πReview https://t.ly/ktXvz
πPaper https://lnkd.in/eFKSZnrc
πProject https://lnkd.in/eSDfccv8
πRepo https://lnkd.in/efCSvjtp
π₯6β€1π1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π ITTO: Protocol for Dynamic Trackingπ 
πITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0π
πReview https://t.ly/tN84a
πPaper https://arxiv.org/pdf/2510.19819
πProject https://glab-caltech.github.io/ITTO/
πRepo https://github.com/ilonadem/itto
πITTO by Caltech is a novel long-range tracking benchmark suite for evaluating and diagnosing tracking methods on complex and long-range motions. Repo under CC BY-NC 4.0π
πReview https://t.ly/tN84a
πPaper https://arxiv.org/pdf/2510.19819
πProject https://glab-caltech.github.io/ITTO/
πRepo https://github.com/ilonadem/itto
β€5π₯1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦Character Mixing Generationπ¦
πMBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
πReview https://t.ly/tN84a
πPaper https://lnkd.in/dhKMwukv
πProject https://lnkd.in/dBkJs48h
πRepo https://lnkd.in/dw_uzgAk
πMBZUAI unveils the first ever video-gen system able to preserve character ID, behavior & original style while generating plausible interactions between characters that have never coexisted - from cartoons (We Bare Bears, Tom & Jerry) to realistic humans (Mr. Bean, Young Sheldon)
πReview https://t.ly/tN84a
πPaper https://lnkd.in/dhKMwukv
πProject https://lnkd.in/dBkJs48h
πRepo https://lnkd.in/dw_uzgAk
π€©4β€1π1π1
  