PWC-Net with an Optical Flow 101 Taste

In this blog post, I’ll analyze PWC-Net, which is a CVPR ’18 oral. If someone needs an additional supervision to their model, PWC-Net is a common and natural choice as for now. The implementation can be found in this link.

PWC-Net is an end-to-end, well-engineered CNN model for optical flow. The naming of PWC encodes the main contributions of this paper: pyramidal processing, warping, and cost volume. I think that this is a very didactic paper to read, someone can learn the foundations of the optical flow and how this is injected to a deep network.

If you’re reading this blog post, most probably you know that optical flow is the 2D real-valued motion vector for every pixel from one frame to the next. The goal of optical-flow estimation is to predict 2-D (x and y) motion vector for every pixel in the first frame. Well, this is not an easy task.

Optical flow is started with the brightness consistency assumption. Long story short, we can grab a patch from the first image and match it in the next frame, and the problem solved since we know the motion? In classical computer vision (e.g. Farneback, Lucas Kanade, 1981), you can get satisfactory results by adjusting lots of parameters. To convey this domain knowledge to a deep network, the cost volume in PWC-Net utilizes this idea by the operation called cross-correlation. This operation actually makes the network to brightness changes in these so-called patches.

However, a simple matching algorithm suffers when the scale changes, the details are not informative enough or/and there are some occlusions. For example, predicting the optical flow for a car is moving towards the camera is hard a problem when the car has a solid color, occluded with some trees and pedestrians. The solution is simply in a vein of asking your neighbor. As a remedy, introducing pyramidal processing (remember FPN) enables us to learn features at multi-scale. The network can make use of fine-grained features for the smaller details, and as you may guess, the network can handle bigger objects in the larger scales to capture a global context and aggregate more information.

Given two consecutive frames and pyramidal processing and cost volume tools, we can compute features then cost volume by correlating these features at each scale. In further, this cost volume can directly be used to predict optical flow. From smaller to larger scales, PWC-Net warps the CNN feature maps with the predicted & upsampled flow previously; then uses these warped features and the features from the first image to build the new cost volume at each level, which will be used to predict optical flow again. This enables the network to keep the search range of smaller by dividing the motion calculation into several levels; after each warping, the motion information is propagated through pyramid levels.

All of these design steps make PWC-Net not only efficient but also effective. This table below is my favorite from the paper; PWC-Net receives the best performance EPE (end point error, the MSE between the predicted and ground truth optical flow, lower is better), while keeping the parameters under control.

Figure 1 from the paper. PWCNet is 17 times smaller in size and easier to train than the FlowNet2 model.

Figure 1. Left: PWC-Net outperforms all published methods on the MPI Sintel final pass benchmark in both accuracy and running time. Right: among existing end-to-end CNN models for flow, PWC-Net reaches the best balance between accuracy and size. We see EPE; which is smaller is better.

If you are more curious, I can recommend Deqing Sun’s talk at ROB 2018, he explains each step ๐Ÿ™‚

Deqing Sun’s talk for PWC-Net. Highly recommended.

NVAE: A Leap Forward

In this post, I’ll analyzeย Nouveau VAE (NVAE) paper by Arash Vahdat and Jan Kautz. It is a Neural Information Processing Systems (NeurIPS) 2020 (spotlight) paper, you can access the paper from this link, and the source code is available.


Nouveau Variational Autoencoder (NVAE) is the first VAE that is capable of generating high-quality images (up to 256 x 256, which is very good) with the same VAE objective function (KL). The main contribution of this paper is making use of residual conditioned distributions. Conditioned by the previously calculated posteriors which are used as a prior in the next, they sample a new posterior at each level. Moreover, the model is still fast in the generation of images and stable during the training.

Main takeaways:

  • They make VAE a competitive model by carefully designing it. They performed extensive experiments (on 4 datasets: MNIST, CIFAR-10, FFHQ, Celeb A and ImageNet) and ablated their improvements one by one. NVAE outperforms the state-of-the-art non-autoregressive flow and VAE models except in ImageNet. On the other hand, it is the first model VAE that is trained on FFHQ.
  • They added depth wise separable convolutions to increase the receptive field. Multi-scale helps.
  • They introduced batch-norm, Swish activation function and squeeze excitation in each residual block to further boost the performance, proven by the experiments.
  • To make KL loss bounded, they introduced a new residual parametrization spectral regularization to make the training robust.
  • They authors included a stability trick which is called as spectral norm regularization. In short, the aim is to regularize Lipschitz at each layer.

Shortcomings and limitations:

  • Even though the competitive performance, they are not comparing their model to GAN models since there is still a gap. In the related work part, they even donโ€™t mention adversarial training.
  • On the other hand, the work is done on the top of Inverse Autoregressive Flows (IAFs) but the differences can be given with more detail.
  • Moreover, even though their engineering effort deserves admiration, their model is susceptible to small changes in parameters.


  • Even though, generated faces are crisp, the texture on the generated faces are too smooth. There are no wrinkles on the faces. Why is that?
  • What can be done to further improve the NVAE? Is this the limit of VAE? We don’t know it, yet.
  • Should we count using mixed-precision as a contribution? I think it’s a NVIDIA APEX library ad ๐Ÿ™‚

CheatSheetCeption: Cheat Sheet of Cheat Sheets

I have compiled several cheatsheets for work in this post, covering important stuff such as Vim, CLI, Anaconda, Git, Numpy, Matplotlib, PyTorch, Docker, and AWS-CLI. Enjoy your Swiss Army Knife!


We should admit that It is hard to learn Vim. This cheat sheet will be a life savior, and especially you forgot to exit Vim ๐Ÿ˜€

Command-line Environment

Git-Tower’s cheat sheet covers most of the things for CLI but missing some handy tricks like using aliases and sending signals to processes

For these tricks, I can recommend The Missing Semester of Your CS Education course notes from MIT, especially shell tools and command-line environment lectures. 

Anaconda – Python

From the official Anaconda website, this is the cheat sheet that I’ve been using almost daily.


If you are a newbie to Git VCS, these two are very useful while handling operations in a project.

Numpy & Matplotlib & PyTorch

Honestly, I’m not using these, but it can be an excellent decor to any AI-related office ๐Ÿ™‚

plus, someone should convert this PyTorch documentation page to a poster.


These are also cool. I’m not a DevOps guy, but these are useful when deploying MVP stuff.


Well, this cheat sheet is very, very good. I think I may not need StackOverflow for AWS-CLI related issues after discovering this ๐Ÿ˜€