PWC-Net with an Optical Flow 101 Taste

In this blog post, I’ll analyze PWC-Net, which is a CVPR ’18 oral. If someone needs an additional supervision to their model, PWC-Net is a common and natural choice as for now. The implementation can be found in this link.

PWC-Net is an end-to-end, well-engineered CNN model for optical flow. The naming of PWC encodes the main contributions of this paper: pyramidal processing, warping, and cost volume. I think that this is a very didactic paper to read, someone can learn the foundations of the optical flow and how this is injected to a deep network.

If you’re reading this blog post, most probably you know that optical flow is the 2D real-valued motion vector for every pixel from one frame to the next. The goal of optical-flow estimation is to predict 2-D (x and y) motion vector for every pixel in the first frame. Well, this is not an easy task.

Optical flow is started with the brightness consistency assumption. Long story short, we can grab a patch from the first image and match it in the next frame, and the problem solved since we know the motion? In classical computer vision (e.g. Farneback, Lucas Kanade, 1981), you can get satisfactory results by adjusting lots of parameters. To convey this domain knowledge to a deep network, the cost volume in PWC-Net utilizes this idea by the operation called cross-correlation. This operation actually makes the network to brightness changes in these so-called patches.

However, a simple matching algorithm suffers when the scale changes, the details are not informative enough or/and there are some occlusions. For example, predicting the optical flow for a car is moving towards the camera is hard a problem when the car has a solid color, occluded with some trees and pedestrians. The solution is simply in a vein of asking your neighbor. As a remedy, introducing pyramidal processing (remember FPN) enables us to learn features at multi-scale. The network can make use of fine-grained features for the smaller details, and as you may guess, the network can handle bigger objects in the larger scales to capture a global context and aggregate more information.

Given two consecutive frames and pyramidal processing and cost volume tools, we can compute features then cost volume by correlating these features at each scale. In further, this cost volume can directly be used to predict optical flow. From smaller to larger scales, PWC-Net warps the CNN feature maps with the predicted & upsampled flow previously; then uses these warped features and the features from the first image to build the new cost volume at each level, which will be used to predict optical flow again. This enables the network to keep the search range of smaller by dividing the motion calculation into several levels; after each warping, the motion information is propagated through pyramid levels.

All of these design steps make PWC-Net not only efficient but also effective. This table below is my favorite from the paper; PWC-Net receives the best performance EPE (end point error, the MSE between the predicted and ground truth optical flow, lower is better), while keeping the parameters under control.

Figure 1 from the paper. PWCNet is 17 times smaller in size and easier to train than the FlowNet2 model.

Figure 1. Left: PWC-Net outperforms all published methods on the MPI Sintel final pass benchmark in both accuracy and running time. Right: among existing end-to-end CNN models for flow, PWC-Net reaches the best balance between accuracy and size. We see EPE; which is smaller is better.

If you are more curious, I can recommend Deqing Sun’s talk at ROB 2018, he explains each step 🙂

Deqing Sun’s talk for PWC-Net. Highly recommended.

Leave a Comment