This media is not supported in your browser
    VIEW IN TELEGRAM
  π«  X-Portrait 2: SOTA(?) Portrait Animation π«  
 
πByteDance unveils a preview of X-Portrait2, the new SOTA expression encoder model that implicitly encodes every minuscule expressions from the input by training it on large-scale datasets. Impressive results but no paper & code announced.
 
πReview https://t.ly/8Owh9 [UPDATE]
πPaper ?
πProject byteaigc.github.io/X-Portrait2/
πRepo ?
πByteDance unveils a preview of X-Portrait2, the new SOTA expression encoder model that implicitly encodes every minuscule expressions from the input by training it on large-scale datasets. Impressive results but no paper & code announced.
πReview https://t.ly/8Owh9 [UPDATE]
πPaper ?
πProject byteaigc.github.io/X-Portrait2/
πRepo ?
π₯13π€―5π4β€1π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  βοΈDonβt Look Twice: ViT by RLTβοΈ 
 
πCMU unveils RLT: speeding up the video transformers inspired by run-length encoding for data compression. Speed the training up and reducing the token count by up to 80%! Source Code announced π
 
πReview https://t.ly/ccSwN
πPaper https://lnkd.in/d6VXur_q
πProject https://lnkd.in/d4tXwM5T
πRepo TBA
πCMU unveils RLT: speeding up the video transformers inspired by run-length encoding for data compression. Speed the training up and reducing the token count by up to 80%! Source Code announced π
πReview https://t.ly/ccSwN
πPaper https://lnkd.in/d6VXur_q
πProject https://lnkd.in/d4tXwM5T
πRepo TBA
π₯9π3β€1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πSeedEdit: foundational T2Iπ 
 
πByteDance unveils a novel T2I foundational model capable of delivering stable, high-aesthetic image edits which maintain image quality through unlimited rounds of editing instructions. No code announced but a Demo is onlineπ
πReview https://t.ly/hPlnN
πPaper https://arxiv.org/pdf/2411.06686
πProject team.doubao.com/en/special/seededit
π€Demo https://huggingface.co/spaces/ByteDance/SeedEdit-APP
πByteDance unveils a novel T2I foundational model capable of delivering stable, high-aesthetic image edits which maintain image quality through unlimited rounds of editing instructions. No code announced but a Demo is onlineπ
πReview https://t.ly/hPlnN
πPaper https://arxiv.org/pdf/2411.06686
πProject team.doubao.com/en/special/seededit
π€Demo https://huggingface.co/spaces/ByteDance/SeedEdit-APP
π₯10β€6π€©2
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π₯ 4 NanoSeconds inference  π₯ 
 
πLogicTreeNet: convolutional differentiable logic gate net. with logic gate tree kernels: Computer Vision into differentiable LGNs. Up to 6100% smaller than SOTA, inference in 4 NANOsecs!
 
πReview https://t.ly/GflOW
πPaper https://lnkd.in/dAZQr3dW
πFull clip https://lnkd.in/dvDJ3j-u
πLogicTreeNet: convolutional differentiable logic gate net. with logic gate tree kernels: Computer Vision into differentiable LGNs. Up to 6100% smaller than SOTA, inference in 4 NANOsecs!
πReview https://t.ly/GflOW
πPaper https://lnkd.in/dAZQr3dW
πFull clip https://lnkd.in/dvDJ3j-u
π₯29π€―12π1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π₯οΈ Global Tracklet Association MOT π₯οΈ 
 
πA novel universal, model-agnostic method designed to refine and enhance tracklet association for single-camera MOT. Suitable for datasets such as SportsMOT, SoccerNet & similar. Source code releasedπ
 
πReview https://t.ly/gk-yh
πPaper https://lnkd.in/dvXQVKFw
πRepo https://lnkd.in/dEJqiyWs
πA novel universal, model-agnostic method designed to refine and enhance tracklet association for single-camera MOT. Suitable for datasets such as SportsMOT, SoccerNet & similar. Source code releasedπ
πReview https://t.ly/gk-yh
πPaper https://lnkd.in/dvXQVKFw
πRepo https://lnkd.in/dEJqiyWs
π10π₯4β€2
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π§Ά MagicQuill: super-easy Diffusion Editing π§Ά 
 
πMagicQuill is a novel system designed to support users in smart editing of images. Robust UI/UX (e.g., inserting/erasing objects, colors, etc.) under a multimodal LLM to anticipate user intentions in real time. Code & Demos released π
 
πReview https://t.ly/hJyLa
πPaper https://arxiv.org/pdf/2411.09703
πProject https://magicquill.art/demo/
πRepo https://github.com/magic-quill/magicquill
πDemo https://huggingface.co/spaces/AI4Editing/MagicQuill
πMagicQuill is a novel system designed to support users in smart editing of images. Robust UI/UX (e.g., inserting/erasing objects, colors, etc.) under a multimodal LLM to anticipate user intentions in real time. Code & Demos released π
πReview https://t.ly/hJyLa
πPaper https://arxiv.org/pdf/2411.09703
πProject https://magicquill.art/demo/
πRepo https://github.com/magic-quill/magicquill
πDemo https://huggingface.co/spaces/AI4Editing/MagicQuill
π€©7π₯4β€3π2
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π§° EchoMimicV2: Semi-body Human π§°
 
πAlipay (ANT Group) unveils EchoMimicV2, the novel SOTA half-body human animation via APD-Harmonization. See clip with audio (ZH/ENG). Code & Demo announcedπ
 
πReview https://t.ly/enLxJ
πPaper arxiv.org/pdf/2411.10061
πProject antgroup.github.io/ai/echomimic_v2/
πRepo-v2 github.com/antgroup/echomimic_v2
πRepo-v1 https://github.com/antgroup/echomimic
πAlipay (ANT Group) unveils EchoMimicV2, the novel SOTA half-body human animation via APD-Harmonization. See clip with audio (ZH/ENG). Code & Demo announcedπ
πReview https://t.ly/enLxJ
πPaper arxiv.org/pdf/2411.10061
πProject antgroup.github.io/ai/echomimic_v2/
πRepo-v2 github.com/antgroup/echomimic_v2
πRepo-v1 https://github.com/antgroup/echomimic
β€5π₯5π2
  This media is not supported in your browser
    VIEW IN TELEGRAM
  βοΈSAMurai: SAM for TrackingβοΈ 
 
πUWA unveils SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. New SOTA! Code under Apache 2.0π
 
πReview https://t.ly/yGU0P
πPaper https://arxiv.org/pdf/2411.11922
πRepo https://github.com/yangchris11/samurai
πProject https://yangchris11.github.io/samurai/
πUWA unveils SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. New SOTA! Code under Apache 2.0π
πReview https://t.ly/yGU0P
πPaper https://arxiv.org/pdf/2411.11922
πRepo https://github.com/yangchris11/samurai
πProject https://yangchris11.github.io/samurai/
π₯20β€6π2β‘1π1π€―1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦Dino-X: Unified Obj-Centric LVMπ¦ 
 
πUnified vision model for Open-World Detection, Segmentation, Phrase Grounding, Visual Counting, Pose, Prompt-Free Detection/Recognition, Dense Caption, & more. Demo & API announced π
 
πReview https://t.ly/CSQon
πPaper https://lnkd.in/dc44ZM8v
πProject https://lnkd.in/dehKJVvC
πRepo https://lnkd.in/df8Kb6iz
πUnified vision model for Open-World Detection, Segmentation, Phrase Grounding, Visual Counting, Pose, Prompt-Free Detection/Recognition, Dense Caption, & more. Demo & API announced π
πReview https://t.ly/CSQon
πPaper https://lnkd.in/dc44ZM8v
πProject https://lnkd.in/dehKJVvC
πRepo https://lnkd.in/df8Kb6iz
π₯12π€―8β€4π3π€©1
  πAll Languages Matter: LMMs vs. 100 Lang.π 
 
πALM-Bench aims to assess the next generation of massively multilingual multimodal models in a standardized way, pushing the boundaries of LMMs towards better cultural understanding and inclusivity. Code & Dataset π
 
πReview https://t.ly/VsoJB
πPaper https://lnkd.in/ddVVZfi2
πProject https://lnkd.in/dpssaeRq
πCode https://lnkd.in/dnbaJJE4
πDataset https://lnkd.in/drw-_95v
πALM-Bench aims to assess the next generation of massively multilingual multimodal models in a standardized way, pushing the boundaries of LMMs towards better cultural understanding and inclusivity. Code & Dataset π
πReview https://t.ly/VsoJB
πPaper https://lnkd.in/ddVVZfi2
πProject https://lnkd.in/dpssaeRq
πCode https://lnkd.in/dnbaJJE4
πDataset https://lnkd.in/drw-_95v
β€3π1π1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦ EdgeCape: SOTA Agnostic Pose π¦ 
 
πEdgeCap: new SOTA in Category-Agnostic Pose Estimation (CAPE): finding keypoints across diverse object categories using only one or a few annotated support images. Source code releasedπ
 
πReview https://t.ly/4TpAs
πPaper https://arxiv.org/pdf/2411.16665
πProject https://orhir.github.io/edge_cape/
πCode https://github.com/orhir/EdgeCape
πEdgeCap: new SOTA in Category-Agnostic Pose Estimation (CAPE): finding keypoints across diverse object categories using only one or a few annotated support images. Source code releasedπ
πReview https://t.ly/4TpAs
πPaper https://arxiv.org/pdf/2411.16665
πProject https://orhir.github.io/edge_cape/
πCode https://github.com/orhir/EdgeCape
π₯10π1π€―1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π StableAnimator: ID-aware Humans π 
 
πStableAnimator: first e2e ID-preserving diffusion for HQ videos without any post-processing. Input: single image + sequence of poses. Insane results!
 
πReview https://t.ly/JDtL3
πPaper https://arxiv.org/pdf/2411.17697
πProject francis-rings.github.io/StableAnimator/
πCode github.com/Francis-Rings/StableAnimator
πStableAnimator: first e2e ID-preserving diffusion for HQ videos without any post-processing. Input: single image + sequence of poses. Insane results!
πReview https://t.ly/JDtL3
πPaper https://arxiv.org/pdf/2411.17697
πProject francis-rings.github.io/StableAnimator/
πCode github.com/Francis-Rings/StableAnimator
π12β€3π€―2π₯1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π§ΆSOTA track-by-propagationπ§Ά 
 
πSambaMOTR is a novel e2e model (based on Samba) for long-range dependencies and interactions between tracklets to handle complex motion patterns / occlusions. Code in Jan. 25 π
 
πReview https://t.ly/QSQ8L
πPaper arxiv.org/pdf/2410.01806
πProject sambamotr.github.io/
πRepo https://lnkd.in/dRDX6nk2
πSambaMOTR is a novel e2e model (based on Samba) for long-range dependencies and interactions between tracklets to handle complex motion patterns / occlusions. Code in Jan. 25 π
πReview https://t.ly/QSQ8L
πPaper arxiv.org/pdf/2410.01806
πProject sambamotr.github.io/
πRepo https://lnkd.in/dRDX6nk2
β€5π₯2π€―1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πΊHiFiVFS: Extreme Face SwappingπΊ 
  
πHiFiVFS: HQ face swapping videos even in extremely challenging scenarios (occlusion, makeup, lights, extreme poses, etc.). Impressive results, no code announcedπ’
  
πReview https://t.ly/ea8dU
πPaper https://arxiv.org/pdf/2411.18293
πProject https://cxcx1996.github.io/HiFiVFS
πHiFiVFS: HQ face swapping videos even in extremely challenging scenarios (occlusion, makeup, lights, extreme poses, etc.). Impressive results, no code announcedπ’
πReview https://t.ly/ea8dU
πPaper https://arxiv.org/pdf/2411.18293
πProject https://cxcx1996.github.io/HiFiVFS
π€―13β€2π₯2π1π1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π₯Video Depth without Video Modelsπ₯ 
 
πRollingDepth: turning a single-image latent diffusion model (LDM) into the novel SOTA depth estimator. It works better than dedicated model for depth π€― Code under Apacheπ
 
πReview https://t.ly/R4LqS
πPaper https://arxiv.org/pdf/2411.19189
πProject https://rollingdepth.github.io/
πRepo https://github.com/prs-eth/rollingdepth
πRollingDepth: turning a single-image latent diffusion model (LDM) into the novel SOTA depth estimator. It works better than dedicated model for depth π€― Code under Apacheπ
πReview https://t.ly/R4LqS
πPaper https://arxiv.org/pdf/2411.19189
πProject https://rollingdepth.github.io/
πRepo https://github.com/prs-eth/rollingdepth
π₯14π€―4π2π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  β½Universal Soccer Foundation Modelβ½
 
πUniversal Soccer Video Understanding: SoccerReplay-1988 - the largest multi-modal soccer dataset - and MatchVision - the first vision-lang. foundation models for soccer. Code, dataset & checkpoints to be releasedπ
 
πReview https://t.ly/-X90B
πPaper https://arxiv.org/pdf/2412.01820
πProject https://jyrao.github.io/UniSoccer/
πRepo https://github.com/jyrao/UniSoccer
πUniversal Soccer Video Understanding: SoccerReplay-1988 - the largest multi-modal soccer dataset - and MatchVision - the first vision-lang. foundation models for soccer. Code, dataset & checkpoints to be releasedπ
πReview https://t.ly/-X90B
πPaper https://arxiv.org/pdf/2412.01820
πProject https://jyrao.github.io/UniSoccer/
πRepo https://github.com/jyrao/UniSoccer
π₯8β€2π2π€©1π1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πMotion Prompting Video Generationπ 
πDeepMind unveils ControlNet, novel video generation model conditioned on spatio-temporally sparse or dense motion trajectories. Amazing results, but no code announced π’
 
πReview https://t.ly/VyKbv
πPaper arxiv.org/pdf/2412.02700
πProject motion-prompting.github.io
πDeepMind unveils ControlNet, novel video generation model conditioned on spatio-temporally sparse or dense motion trajectories. Amazing results, but no code announced π’
πReview https://t.ly/VyKbv
πPaper arxiv.org/pdf/2412.02700
πProject motion-prompting.github.io
π₯13β€5π1π’1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦AniGS: Single Pic Animatable Avatarπ¦
 
π#Alibaba unveils AniGS: given a single human image as input it rebuilds a Hi-Fi 3D avatar in a canonical pose, which can be used for both photorealistic rendering & real-time animation. Source code announced, to be releasedπ
 
πReview https://t.ly/4yfzn
πPaper arxiv.org/pdf/2412.02684
πProject lingtengqiu.github.io/2024/AniGS/
πRepo github.com/aigc3d/AniGS
π#Alibaba unveils AniGS: given a single human image as input it rebuilds a Hi-Fi 3D avatar in a canonical pose, which can be used for both photorealistic rendering & real-time animation. Source code announced, to be releasedπ
πReview https://t.ly/4yfzn
πPaper arxiv.org/pdf/2412.02684
πProject lingtengqiu.github.io/2024/AniGS/
πRepo github.com/aigc3d/AniGS
1β€11π₯7π3π€©2π1πΎ1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π§€GigaHands: Massive #3D Handsπ§€
 
πNovel massive #3D bimanual activities dataset: 34 hours of activities, 14k hand motions clips paired with 84k text annotation, 183M+ unique hand images
 
πReview https://t.ly/SA0HG
πPaper www.arxiv.org/pdf/2412.04244
πRepo github.com/brown-ivl/gigahands
πProject ivl.cs.brown.edu/research/gigahands.html
πNovel massive #3D bimanual activities dataset: 34 hours of activities, 14k hand motions clips paired with 84k text annotation, 183M+ unique hand images
πReview https://t.ly/SA0HG
πPaper www.arxiv.org/pdf/2412.04244
πRepo github.com/brown-ivl/gigahands
πProject ivl.cs.brown.edu/research/gigahands.html
β€7π1π€©1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  π¦’ Track4Gen: Diffusion + Tracking π¦’
 
πTrack4Gen: spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. GenAI with points-based motion control. Stunning results but no code announcedπ’
 
πReview https://t.ly/9ujhc
πPaper arxiv.org/pdf/2412.06016
πProject hyeonho99.github.io/track4gen/
πGallery hyeonho99.github.io/track4gen/full.html
πTrack4Gen: spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. GenAI with points-based motion control. Stunning results but no code announcedπ’
πReview https://t.ly/9ujhc
πPaper arxiv.org/pdf/2412.06016
πProject hyeonho99.github.io/track4gen/
πGallery hyeonho99.github.io/track4gen/full.html
β€3π₯3πΎ1
  This media is not supported in your browser
    VIEW IN TELEGRAM
  πΉ 4D Neural Templates πΉ 
 
π#Stanford unveils Neural Templates, generating HQ temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, at any time of their lifespan. A novel task in vision is bornπ
 
πReview https://t.ly/ka_Qf
πPaper https://arxiv.org/pdf/2412.05278
πProject https://chen-geng.com/rose4d#toi
π#Stanford unveils Neural Templates, generating HQ temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, at any time of their lifespan. A novel task in vision is bornπ
πReview https://t.ly/ka_Qf
πPaper https://arxiv.org/pdf/2412.05278
πProject https://chen-geng.com/rose4d#toi
π₯8β€2β‘1π1π€©1