This media is not supported in your browser
VIEW IN TELEGRAM
#AI #DeepLearning #ComputerVision #YOLO #AttentionMechanism #OpenSource
Please open Telegram to view this post
VIEW IN TELEGRAM
π5β€1
Title of paper:
Audio-Visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Authors:
Fa-Ting Hong, Zunnan Xu, Zixiang Zhou, Jun Zhou, Xiu Li, Qin Lin, Qinglin Lu, Dan Xu
Description:
This paper introduces ACTalker, an end-to-end video diffusion framework designed for natural talking head generation with both multi-signal and single-signal control capabilities.
The framework employs a parallel Mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions.
A gate mechanism is applied across all branches, providing flexible control over video generation.
To ensure natural coordination of the controlled video both temporally and spatially, the Mamba structure enables driving signals to manipulate feature tokens across both dimensions in each branch.
Additionally, a mask-drop strategy is introduced, allowing each driving signal to independently control its corresponding facial region within the Mamba structure, preventing control conflicts.
Experimental results demonstrate that this method produces natural-looking facial videos driven by diverse signals, and that the Mamba layer seamlessly integrates multiple driving modalities without conflict.
Link of abstract paper:
https://arxiv.org/abs/2504.00000
Link of download paper:
https://arxiv.org/pdf/2504.00000.pdf
Code:
https://github.com/harlanhong/actalker
Datasets used in paper:
The paper does not specify the datasets used.
Hugging Face demo:
No Hugging Face demo available.
#ACTalker #TalkingHeadGeneration #VideoDiffusion #MultimodalControl #MambaStructure #DeepLearning #ComputerVision #AI #OpenSource
Audio-Visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Authors:
Fa-Ting Hong, Zunnan Xu, Zixiang Zhou, Jun Zhou, Xiu Li, Qin Lin, Qinglin Lu, Dan Xu
Description:
This paper introduces ACTalker, an end-to-end video diffusion framework designed for natural talking head generation with both multi-signal and single-signal control capabilities.
The framework employs a parallel Mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions.
A gate mechanism is applied across all branches, providing flexible control over video generation.
To ensure natural coordination of the controlled video both temporally and spatially, the Mamba structure enables driving signals to manipulate feature tokens across both dimensions in each branch.
Additionally, a mask-drop strategy is introduced, allowing each driving signal to independently control its corresponding facial region within the Mamba structure, preventing control conflicts.
Experimental results demonstrate that this method produces natural-looking facial videos driven by diverse signals, and that the Mamba layer seamlessly integrates multiple driving modalities without conflict.
Link of abstract paper:
https://arxiv.org/abs/2504.00000
Link of download paper:
https://arxiv.org/pdf/2504.00000.pdf
Code:
https://github.com/harlanhong/actalker
Datasets used in paper:
The paper does not specify the datasets used.
Hugging Face demo:
No Hugging Face demo available.
#ACTalker #TalkingHeadGeneration #VideoDiffusion #MultimodalControl #MambaStructure #DeepLearning #ComputerVision #AI #OpenSource
π4
This media is not supported in your browser
VIEW IN TELEGRAM
Adobe unveils HUMOTO, a high-quality #dataset of human-object interactions designed for #motiongeneration, #computervision, and #robotics. It features over 700 sequences (7,875 seconds @ 30FPS) with interactions involving 63 precisely modeled objects and 72 articulated partsβa rich resource for researchers and developers in the field.
#HUMOTO #4DMocap #HumanObjectInteraction #AdobeResearch #AI #MachineLearning #PoseEstimation
Please open Telegram to view this post
VIEW IN TELEGRAM
π5β€1π₯1
This media is not supported in your browser
VIEW IN TELEGRAM
The Oxford VGG unveils Geo4D, a breakthrough in #videodiffusion for monocular 4D reconstruction. Trained only on synthetic data, Geo4D still achieves strong generalization to real-world scenarios. It outputs point maps, depth, and ray maps, setting a new #SOTA in dynamic scene reconstruction. Code is now released!
#Geo4D #4DReconstruction #DynamicScenes #OxfordVGG #ComputerVision #MachineLearning #DiffusionModels
Please open Telegram to view this post
VIEW IN TELEGRAM
β€2π2
This media is not supported in your browser
VIEW IN TELEGRAM
π₯ General Attention-Based Object Detection π₯
π GATE3D is a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions.
π Review: https://t.ly/O7wqH
π Paper: https://lnkd.in/dc5VTUj9
π Project: https://lnkd.in/dzrt-qQV
#3DObjectDetection #Monocular3D #DeepLearning #WeakSupervision #ComputerVision #AI #MachineLearning #GATE3D
β‘οΈ BEST DATA SCIENCE CHANNELS ON TELEGRAM π
π GATE3D is a novel framework designed specifically for generalized monocular 3D object detection via weak supervision. GATE3D effectively bridges domain gaps by employing consistency losses between 2D and 3D predictions.
π Review: https://t.ly/O7wqH
π Paper: https://lnkd.in/dc5VTUj9
π Project: https://lnkd.in/dzrt-qQV
#3DObjectDetection #Monocular3D #DeepLearning #WeakSupervision #ComputerVision #AI #MachineLearning #GATE3D
Please open Telegram to view this post
VIEW IN TELEGRAM
π3β€1
This media is not supported in your browser
VIEW IN TELEGRAM
NVIDIA introduces Describe Anything Model (DAM)
a new state-of-the-art model designed to generate rich, detailed descriptions for specific regions in images and videos. Users can mark these regions using points, boxes, scribbles, or masks.
DAM sets a new benchmark in multimodal understanding, with open-source code under the Apache license, a dedicated dataset, and a live demo available on Hugging Face.
Explore more below:
Paper: https://lnkd.in/dZh82xtV
Project Page: https://lnkd.in/dcv9V2ZF
GitHub Repo: https://lnkd.in/dJB9Ehtb
Hugging Face Demo: https://lnkd.in/dXDb2MWU
Review: https://t.ly/la4JD
a new state-of-the-art model designed to generate rich, detailed descriptions for specific regions in images and videos. Users can mark these regions using points, boxes, scribbles, or masks.
DAM sets a new benchmark in multimodal understanding, with open-source code under the Apache license, a dedicated dataset, and a live demo available on Hugging Face.
Explore more below:
Paper: https://lnkd.in/dZh82xtV
Project Page: https://lnkd.in/dcv9V2ZF
GitHub Repo: https://lnkd.in/dJB9Ehtb
Hugging Face Demo: https://lnkd.in/dXDb2MWU
Review: https://t.ly/la4JD
#NVIDIA #DescribeAnything #ComputerVision #MultimodalAI #DeepLearning #ArtificialIntelligence #MachineLearning #OpenSource #HuggingFace #GenerativeAI #VisualUnderstanding #Python #AIresearch
https://t.iss.one/DataScienceTβ
Please open Telegram to view this post
VIEW IN TELEGRAM
π5
This media is not supported in your browser
VIEW IN TELEGRAM
πΌ SOTA Textured 3D-Guided VTON πΌ
π #ALIBABA unveils 3DV-TON, a novel diffusion model for HQ and temporally consistent video. Generating animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expense of motion coherence. Code & benchmark to be released π
π Review: https://t.ly/0tjdC
π Paper: https://lnkd.in/dFseYSXz
π Project: https://lnkd.in/djtqzrzs
π Repo: TBA
#AI #3DReconstruction #DiffusionModels #VirtualTryOn #ComputerVision #DeepLearning #VideoSynthesis
https://t.iss.one/DataScienceTπ
π #ALIBABA unveils 3DV-TON, a novel diffusion model for HQ and temporally consistent video. Generating animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expense of motion coherence. Code & benchmark to be released π
π Review: https://t.ly/0tjdC
π Paper: https://lnkd.in/dFseYSXz
π Project: https://lnkd.in/djtqzrzs
π Repo: TBA
#AI #3DReconstruction #DiffusionModels #VirtualTryOn #ComputerVision #DeepLearning #VideoSynthesis
https://t.iss.one/DataScienceT
Please open Telegram to view this post
VIEW IN TELEGRAM
β€2π2
This media is not supported in your browser
VIEW IN TELEGRAM
Saint-Γtienne University has introduced a new 3D human body pose estimation pipeline designed specifically for dance analysis.
Check out the project page featuring results and an interactive demo!
#DanceAnalysis #3DPoseEstimation #DeepLearning #HumanPose #AI #MachineLearning #ComputerVisionResearch
Please open Telegram to view this post
VIEW IN TELEGRAM
π1
This media is not supported in your browser
VIEW IN TELEGRAM
NVIDIA introduces GENMO, a unified generalist model for human motion that seamlessly combines motion estimation and generation within a single framework. GENMO supports conditioning on videos, 2D keypoints, text, music, and 3D keyframes, enabling highly versatile motion understanding and synthesis.
Currently, no official code release is available.
Review:
https://t.ly/Q5T_Y
Paper:
https://lnkd.in/ds36BY49
Project Page:
https://lnkd.in/dAYHhuFU
#NVIDIA #GENMO #HumanMotion #DeepLearning #AI #ComputerVision #MotionGeneration #MachineLearning #MultimodalAI #3DReconstruction
Please open Telegram to view this post
VIEW IN TELEGRAM
π4β€3