In this blog post, I’ll analyze PWC-Net, which is a CVPR ’18 oral. If someone needs an additional supervision to their model, PWC-Net is a common and natural choice as for now. The implementation can be found in this link.
If you’re reading this blog post, most probably you know that optical flow is the 2D real-valued motion vector for every pixel from one frame to the next. The goal of optical-flow estimation is to predict 2-D (x and y) motion vector for every pixel in the first frame. Well, this is not an easy task.
Optical flow is started with the brightness consistency assumption. Long story short, we can grab a patch from the first image and match it in the next frame, and the problem solved since we know the motion? In classical computer vision (e.g. Farneback, Lucas Kanade, 1981), you can get satisfactory results by adjusting lots of parameters. To convey this domain knowledge to a deep network, the cost volume in PWC-Net utilizes this idea by the operation called cross-correlation. This operation actually makes the network to brightness changes in these so-called patches.
However, a simple matching algorithm suffers when the scale changes, the details are not informative enough or/and there are some occlusions. For example, predicting the optical flow for a car is moving towards the camera is hard a problem when the car has a solid color, occluded with some trees and pedestrians. The solution is simply in a vein of asking your neighbor. As a remedy, introducing pyramidal processing (remember FPN) enables us to learn features at multi-scale. The network can make use of fine-grained features for the smaller details, and as you may guess, the network can handle bigger objects in the larger scales to capture a global context and aggregate more information.
Given two consecutive frames and pyramidal processing and cost volume tools, we can compute features then cost volume by correlating these features at each scale. In further, this cost volume can directly be used to predict optical flow. From smaller to larger scales, PWC-Net warps the CNN feature maps with the predicted & upsampled flow previously; then uses these warped features and the features from the first image to build the new cost volume at each level, which will be used to predict optical flow again. This enables the network to keep the search range of smaller by dividing the motion calculation into several levels; after each warping, the motion information is propagated through pyramid levels.
All of these design steps make PWC-Net not only efficient but also effective. This table below is my favorite from the paper; PWC-Net receives the best performance EPE (end point error, the MSE between the predicted and ground truth optical flow, lower is better), while keeping the parameters under control.
Figure 1. Left: PWC-Net outperforms all published methods on the MPI Sintel final pass benchmark in both accuracy and running time. Right: among existing end-to-end CNN models for flow, PWC-Net reaches the best balance between accuracy and size. We see EPE; which is smaller is better.
If you are more curious, I can recommend Deqing Sun’s talk at ROB 2018, he explains each step 🙂