A talk on Theoretical Foundations of Graph Neural Networks by Petar Veličković from DeepMind.
In this talk Petar derives GNNs from first principles, motivates their use in the sciences, and explain how they emerged along several research lines.
Should be very interesting for those who wanted to learn about GNNs but could not find a good starting point.
Video: https://youtu.be/uF53xsT7mjc
Slides: https://petar-v.com/talks/GNN-Wednesday.pdf
In this talk Petar derives GNNs from first principles, motivates their use in the sciences, and explain how they emerged along several research lines.
Should be very interesting for those who wanted to learn about GNNs but could not find a good starting point.
Video: https://youtu.be/uF53xsT7mjc
Slides: https://petar-v.com/talks/GNN-Wednesday.pdf
This media is not supported in your browser
VIEW IN TELEGRAM
Guys from RunwayML created an awesome user-friendly demo for our approach "Adaptive Style Transfer".
You can play around with it and easily stylize your own photos. One important thing: the larger an input image, the more crispy becomes a stylization.
Run Models for 8 different artists
Run Picasso model
Run Van Gogh model
Method source code on Github: https://github.com/CompVis/adaptive-style-transfer
You can play around with it and easily stylize your own photos. One important thing: the larger an input image, the more crispy becomes a stylization.
Run Models for 8 different artists
Run Picasso model
Run Van Gogh model
Method source code on Github: https://github.com/CompVis/adaptive-style-transfer
Media is too big
VIEW IN TELEGRAM
Stable View Synthesis (by Vladlen Koltun from Intel)
Given a set of source images depicting a scene from arbitrary viewpoints, it synthesizes new views of the scene.
The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is trained end-to-end.
The results are magnificent!
Source code
Paper
Given a set of source images depicting a scene from arbitrary viewpoints, it synthesizes new views of the scene.
The method operates on a geometric scaffold computed via structure-from-motion and multi-view stereo. Each point on this 3D scaffold is associated with view rays and corresponding feature vectors that encode the appearance of this point in the input images. The core of SVS is view-dependent on-surface feature aggregation, in which directional feature vectors at each 3D point are processed to produce a new feature vector for a ray that maps this point into the new target view. The target view is then rendered by a convolutional network from a tensor of features synthesized in this way for all pixels. The method is trained end-to-end.
The results are magnificent!
Source code
Paper
This media is not supported in your browser
VIEW IN TELEGRAM
VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
I continue discussing deep learning approaches for self-driving cars.
Future motion prediction is a task of paramount importance for autonomous driving. For a self-driving car to safely operate it is crucial to be able to anticipate the actions of other agents on the road.
In this video, I explain VectorNet - one of the methods for future motion prediction based on the vectorized representation of the scene instead of RGB images.
▶️YouTube Video
📝Paper
I continue discussing deep learning approaches for self-driving cars.
Future motion prediction is a task of paramount importance for autonomous driving. For a self-driving car to safely operate it is crucial to be able to anticipate the actions of other agents on the road.
In this video, I explain VectorNet - one of the methods for future motion prediction based on the vectorized representation of the scene instead of RGB images.
▶️YouTube Video
📝Paper
Forwarded from Технологии | Нейросети | Боты
This media is not supported in your browser
VIEW IN TELEGRAM
😅
GANsformer or StyleGAN2 on steroids
Facebook AI Research
GANsformer is a novel and efficient type of transformer, explored for the task of image generation. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. The model iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.
Authors promise to release the pretrained models soon.
📓 arxiv.org/abs/2103.01209
⚒github.com/dorarad/gansformer
Facebook AI Research
GANsformer is a novel and efficient type of transformer, explored for the task of image generation. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. The model iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.
Authors promise to release the pretrained models soon.
📓 arxiv.org/abs/2103.01209
⚒github.com/dorarad/gansformer
Hey Rebyata!
I brought you some good news for software engineers w/o an academic background in AI! 🦾
Yesterday Facebook AI launched a new program to draw more diverse talent into AI - Rotational AI Science & Engineering (RAISE).
RAISE is a 24-month full-time research engineering position in Facebook AI. You will work at 4 different AI teams (6 months at each) during that time and towards the program en, you will select a permanent team in Facebook AI.
Applications deadline: May 10th, 2021 @ 5pm PT.
https://ai.facebook.com/blog/facebook-ais-raise-program-an-innovative-new-career-pathway-into-ai
I brought you some good news for software engineers w/o an academic background in AI! 🦾
Yesterday Facebook AI launched a new program to draw more diverse talent into AI - Rotational AI Science & Engineering (RAISE).
RAISE is a 24-month full-time research engineering position in Facebook AI. You will work at 4 different AI teams (6 months at each) during that time and towards the program en, you will select a permanent team in Facebook AI.
Applications deadline: May 10th, 2021 @ 5pm PT.
https://ai.facebook.com/blog/facebook-ais-raise-program-an-innovative-new-career-pathway-into-ai
This media is not supported in your browser
VIEW IN TELEGRAM
Neural 3D Video Synthesis
Facebook Reality Labs
These guys created a novel time-conditioned Neural Radiance Fields. The results are impressive. When it gets faster, it will enable mind-blowing applications!
It is a sort of extension of NeRF model for videos. The model learns to generate video frames conditioned on position, view direction and time-variant latent code.
Temporal latent codes are optimized jointly with other network parameters.
NeRF model is notoriously slow and requires a long training time. Training separate NERF models for every frame requires ~15K GPU hours, while the proposed model - only 1.3K GPU hours.
📝 Paper: https://arxiv.org/abs/2103.02597
🌐 Project page: https://neural-3d-video.github.io
Facebook Reality Labs
These guys created a novel time-conditioned Neural Radiance Fields. The results are impressive. When it gets faster, it will enable mind-blowing applications!
It is a sort of extension of NeRF model for videos. The model learns to generate video frames conditioned on position, view direction and time-variant latent code.
Temporal latent codes are optimized jointly with other network parameters.
NeRF model is notoriously slow and requires a long training time. Training separate NERF models for every frame requires ~15K GPU hours, while the proposed model - only 1.3K GPU hours.
📝 Paper: https://arxiv.org/abs/2103.02597
🌐 Project page: https://neural-3d-video.github.io
⚙️ Model architecture:
🔪 Limitations:
- Training requires time-synchronized input videos from multiple static cameras with known intrinsic and extrinsic parameters.
- Training for a single 10-seconds video is still quite slow for any real-life application: It takes about a week with 8 x V100 GPUs (~1300 GPU hours).
- Blur in the moving regions in highly dynamic scenes with large and fast motions.
- Apparent artifacts when rendering from viewpoints beyond the bounds of the training views (baseline NERF model has the same problems).
z_t
is a time-variant learnable 1024-dimensional latent code. The rest is almost the same as in NERF.🔪 Limitations:
- Training requires time-synchronized input videos from multiple static cameras with known intrinsic and extrinsic parameters.
- Training for a single 10-seconds video is still quite slow for any real-life application: It takes about a week with 8 x V100 GPUs (~1300 GPU hours).
- Blur in the moving regions in highly dynamic scenes with large and fast motions.
- Apparent artifacts when rendering from viewpoints beyond the bounds of the training views (baseline NERF model has the same problems).
VQGAN: Taming Transformers for High-Resolution Image Synthesis
from my lab in Heidelberg University
Paper explained on my YouTube channel!
Authors introduce VQGAN which combines the efficiency of convolutional approaches with the expressivity of transformers. VQGAN is essentially a GAN that learns a codebook of context-rich visual parts and uses it to quantize the bottleneck representation at every forward pass.
The self-attention model is used to learn a prior distribution of codewords.
Sampling from this model then allows producing plausible constellations of the codewords which are then fed through a decoder to generate realistic images in arbitrary resolution.
📝 Paper
⚙️ Code (with pretrained models)
📓 Colab notebook
📓 Colab notebook to compare the first stage models in VQGAN and in DALL-E
💪🏻🦾🤙🏼
▶️ YouTube Video explanation
from my lab in Heidelberg University
Paper explained on my YouTube channel!
Authors introduce VQGAN which combines the efficiency of convolutional approaches with the expressivity of transformers. VQGAN is essentially a GAN that learns a codebook of context-rich visual parts and uses it to quantize the bottleneck representation at every forward pass.
The self-attention model is used to learn a prior distribution of codewords.
Sampling from this model then allows producing plausible constellations of the codewords which are then fed through a decoder to generate realistic images in arbitrary resolution.
📝 Paper
⚙️ Code (with pretrained models)
📓 Colab notebook
📓 Colab notebook to compare the first stage models in VQGAN and in DALL-E
💪🏻🦾🤙🏼
▶️ YouTube Video explanation
Facebook open-sourced a library for state-of-the-art self-supervised learning: VISSL.
+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.
Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!
VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.
Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!
VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
Self-supervised Pretraining of Visual Features in the Wild
Facebook also published its ultimate SElf-supERvised (SEER) model.
- They pretrained it on a 1B random, unlabeled and uncurated Instagram images 👀.
- SEER outperformed SOTA self-supervised systems, reaching 84.2% top-1 accuracy on ImageNet.
- SEER also outperformed SOTA supervised models on downstream tasks, including low-shot, object detection, segmentation, and image classification.
- When trained with just 10% of the ImageNet, SEER still achieved 77.9% top-1 accuracy on the full data set. When trained with just 1% of the annotated ImageNet examples, SEER achieved 60.5% top-1 accuracy.
- SEER is based on recent RegNet achitecture . Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet l models while being up to 5x faster on GPUs.
📝 Paper
📖 Blogpost
⚙️ I guess the source code will be published as a part of VISSL soon.
Facebook also published its ultimate SElf-supERvised (SEER) model.
- They pretrained it on a 1B random, unlabeled and uncurated Instagram images 👀.
- SEER outperformed SOTA self-supervised systems, reaching 84.2% top-1 accuracy on ImageNet.
- SEER also outperformed SOTA supervised models on downstream tasks, including low-shot, object detection, segmentation, and image classification.
- When trained with just 10% of the ImageNet, SEER still achieved 77.9% top-1 accuracy on the full data set. When trained with just 1% of the annotated ImageNet examples, SEER achieved 60.5% top-1 accuracy.
- SEER is based on recent RegNet achitecture . Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet l models while being up to 5x faster on GPUs.
📝 Paper
📖 Blogpost
⚙️ I guess the source code will be published as a part of VISSL soon.
Meta
SEER: The start of a more powerful, flexible, and accessible era for computer vision
The future of AI is in creating systems that can learn directly from whatever information they’re given — whether it’s text, images, or another type of data — without relying on carefully curated and labeled data sets to teach them how to recognize objects…
This media is not supported in your browser
VIEW IN TELEGRAM
Synthesized StyleGAN2 portrait was tuned using a textual description using CLIP encoder. A man was transformed into a vampire by navigating in the latent space using a query "an image of a man resembling a vampire, with the face of Count Dracula". Video attached.
For me this looks like a sorcery ✨.
➖ Link to the source twitt
📓 Colab notebook
For me this looks like a sorcery ✨.
➖ Link to the source twitt
📓 Colab notebook
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Yann LeCun, FAIR
New self-supervised learning loss: compute cross-correlation matrix between the features of two distorted versions of a sample and make it as close to the identity matrix as possible.
+ This naturally avoids representation collapse and causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
+ It is also robust to the training batch size.
+ Comparable to SOTA self-supervised methods (similar results as BYOL), but the method is conceptually simpler.
⚙️ My favorite part, training resources: 32x V100 GPUs, approx. 124 hours
📝 Paper
🛠 Code (will be released soon)
Yann LeCun, FAIR
New self-supervised learning loss: compute cross-correlation matrix between the features of two distorted versions of a sample and make it as close to the identity matrix as possible.
+ This naturally avoids representation collapse and causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
+ It is also robust to the training batch size.
+ Comparable to SOTA self-supervised methods (similar results as BYOL), but the method is conceptually simpler.
⚙️ My favorite part, training resources: 32x V100 GPUs, approx. 124 hours
📝 Paper
🛠 Code (will be released soon)