S898 Final Project

Every minute, about 500 hours worth of video were uploaded to Youtube as of 2021, and with this exponential growth in video streaming, a low cost high accuracy video understanding is essential. While 2D Convolutional Neural Networks (CNN) have shown great promise in the field of image-based problems, they fail to capture the temporal relationships among frames in videos. 3D CNN based networks utilize the temporal information to outperform the 2D counterparts, but they are expensive to deploy as heavier computations are required.

A still from Titanic (1997). Are they about to kiss, or have they already kissed? Actions can be ambiguous without temporal information.

This blog will focus mainly on the paper TSM: Temporal Shift Module for Efficient Video Understanding by Lin et al. TSM shifts parts of tensors along the temporal dimension, which allows temporal information to be shared among neighboring frames. Due to the nature of the shift operations, TSM can be implemented and inserted to 2D CNNs essentially at zero computation and zero parameters. As a result, the authors argue that TSM can achieve performance of 3D CNNs while having the complexity of 2D CNNs.

Evolution of Video Understanding Models

2D CNN Models

Surprisingly, applying a 2D CNN on a randomly selected video frame (or averaging over multiple frames to be fancy) produces tolerable results although it does not make use of any temporal information. For many years, this method had been considered as a baseline for performance comparisons.

Predicting using a single frame as an input to 2D CNN

In 2014, Karpathy et al. proposed and examined multiple 2D CNN architectures for video action recognition tasks in Large-Scale Video Classification with Convolutional Neural Networks. Some of the proposed networks aim to establish temporal connectivity patterns by fusing information over temporal dimension throughout the network, as shown by the three fusion methods below.

Late Fusion

merges frame information late in the network, after several convolutions

Early Fusion

merges neighboring frame information early in the network

Slow Fusion

merges neighboring frames in multiple windows, then slowly merge in a pyramidic manner

Although these straightforward fusion methods gave a significant boost to the performance and surpassed frame feature-based baseline models, the proposed 2D CNNs that work on individual frames still cannot model the temporal information adequately.

Optical Flow Models

More efforts were made to make better use of the temporal stream using 2D CNNs. In 2014, Simonyan et al. proposed a network that uses optical flow as well as still frame information that is competitive with the state-of-the-art in action recognition benchmarks in Two-stream Convolutional Networks for Action Recognition in Videos. The proposed network uses one stream of CNN to classify the spatial information of a video using a single frame, while having a separate stream of CNN that classifies on a dense optical flow stack. Then, the two streams are merged late in the network to make the final classification.
Although the idea of having multiple streams in a network is still present in many state-of-the-art models (including 3D CNN models), acquiring accurate optical flow information can be expensive and impractical -- especially for real-time tasks.

Two-stream network incorporating frame and optical flow for spatial and temporal inference

3D CNN Models

3D CNNs learn spatial-temporal features jointly by representing video inputs as 3 dimensional tensors, and thus, effectively and directly utilize the temporal information of videos for downstream tasks. In 2015, Tran et al. proposed a simple 3D CNN named C3D (Convolutional 3D) in Learning Spatiotemporal Features with 3D Convolutional Networks. C3D, which has an architecture based on a VGG model, uses 3 by 3 by 3 convolution kernels for all layers in its model, and outperformed previous state-of-the-art 2D CNNs on four different major benchmarks.
With the improved hardware, the community began to explore deeper 3D networks in recent years. In 2018, Carreira et al. introduced a single and two-stream variant models based on inflated 2D ConvNet named I3D (Inflated 3D ConvNet) in Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. I3D uses 3D inflated filters and pooling kernels from a very deep ConvNet image classification architecture, rewriting the state-of-the-art performance in several action classification benchmarks.
While these 3D CNNs significantly improved video understanding tasks, the computation cost became significantly larger as well, making the deployment on commercial devices challenging. More specifically, online or real-time video recognition tasks were impractical with such 3D networks due to latency and throughput issues.

Temporal Shift Module (TSM)

TSM suggests a novel solution for efficient video understanding tasks by incorporating temporal shifts on 2D CNN models, which are computationally free but have comparable spatio-temporal modeling ability to 3D CNN models. In this part of the blog, we delve into the different shift strategies as well as temporal shift module designs described in the paper. Then, we explore the necessary changes that need to be made to the shifting strategies for online real-time tasks.

Intuition

TSM is able to perform efficient temporal modeling by simply shifting the feature map along the temporal dimension, i.e. without any “computations”. A portion of a channel is shifted along one direction of the temporal dimension, while another portion of a channel is shifted along the opposite direction, as shown by the animation above. Rest of the tensors that are unshifted are left without any modification. As a result, feature maps at each time stamp now consist of information from the original frame as well as from both the respective previous and future frames. Missing elements at the edges of the tensors are simply zero padded, and the overflown elements are truncated.
Modified tensors are fed into 2D CNN models called backbone models (e.g. ResNet-50). While TSM introduces almost no extra computation cost, it performs significantly better than the respective vanilla 2D CNN backbone model. Detailed performance comparison will be discussed later in the blog.

TSM Network Animation

Temporal Partial Shift

TSM only shifts a portion of the tensor and leaves the rest of the elements unchanged, and names this the temporal partial shift operation. Empirically, the authors argue that there are two main reasons behind shifting only a portion of the tensors. In this section of the blog, we review the empirical results of various shift proportions and analyze the reasoning behind the specific shift portion used by the authors.

Latency Overhead
Accuracy

While conceptually the shift operation does not add any extra computation (zero FLOP), it still results in data moving around and results in latency on hardware. In other words, shifting a big chunk of tensors for all video inputs is not efficient and results in non-negligible latency. This increase in data movement is more critical in the setting of video inputs as they consume large memory to begin with.

Another obvious reason for temporal partial shift is performance degradation caused by worse spatial modeling. When a portion of a tensor is shifted to a neighboring tensor to share the temporal information, it naturally takes away the spatial information related to the current frame that is needed to make spatial inference. Therefore, it is essential to find the balance between loss of spatial modeling ability and increased temporal information for each frame by adjusting the shifting proportion.

The two graphs above show the empirical experimental results of latency overhead and accuracy versus shift proportion, respectively. First, looking at the graph on the left tab, which shows the change in GPU latency overhead over different shift proportions, we can see that there is a clear increase in latency for higher shift proportions. In fact, for a naive full shift, there is about 12.4% increase in latency, resulting in an nontrivial slower inference.
The second graph on the right tab displays the change in accuracy over various shift proportions. There is a dramatic increase in accuracy from the baseline 2D model when ⅛ of the tensors are shifted. However, the accuracy drops when more than ¼ of the tensors are shifted, with naive full shift performing almost as bad as a 2D baseline model. Keeping both the latency overhead and accuracy performance in mind, the authors chose to shift ¼ of the tensors, which provides significantly higher accuracy in return of about 3% latency overhead.
Note that for the 2D baseline, Temporal Segment Network (TSN) is used, which extracts averaged features from strided sampled frames for efficient video understanding. TSN established new state-of-the-art performance on various action recognition benchmarks upon release.

Module Experiment

In-place TSM

Residual TSM

Another neat module design that balances the model capacity for spatial and temporal information inference is placing the shift operation inside the residual branch in the residual block. An elementary way of applying the shift would be before each convolutional layer, as shown by the diagram on the left (i.e. In-place TSM). This design degrades the spatial feature learning ability, especially when a large portion of the tensors is shifted, since the spatial information necessary for the current frame is lost as it is shifted to the neighboring frames.
By placing the shift operation within the residual branch (i.e. Residual TSM) as shown by the diagram on the right, all the original information is still accessible even after the temporal shift through identity mapping. This variant therefore takes advantage of both the shift operation that is helpful for temporal inference as well as the original information that is useful for spatial inference.

The figure above compares the accuracies of In-place TSM and Residual TSM on the Kinetics dataset for various shift proportions. Residual TSM consistently performs superior than In-place TSM with higher accuracies, and therefore, this variant of TSM is selected by the authors for further experiments.

Uni-direction TSM

While TSM is much suitable for online, real-time applications, one small change has to be made to the shift operation. As future frames are not present in online settings, only information from the past frames can be mingled with the current frame to share the temporal information. In this uni-direction TSM variant, tensors are therefore only shifted in one direction.

Online TSM Network Animation

TSM Performance Analysis

TSM vs. 2D Baseline on various datasets

Here, we compare the performance of TSM to the 2D baseline model, TSN, on multiple datasets for action recognition tasks. For Kinetics, UCF101, and HMDB51, TSM consistently performs superior than TSN on both Top1 and Top5 accuracies. Refer to the table above for detailed performance scores.

Next, we look into datasets that are more sensitive to temporal relationships. In Something-Something (V1, V2) and Jester, TSM significantly performs better on both Top1 and Top5 accuracies than TSN due to its temporal modeling ability.

TSM vs. state-of-the-art 3D Models on Something-Something dataset

We also analyze TSM’s performance versus state-of-the-art 3D CNN models on Something-Something V1 dataset. TSM (with ResNet-50 backbone) has 48.6M parameters, and it achieves 49.7% Top1 accuracy with just 98G FLOPs (floating point operation per second).
ECO is a state-of-the-art efficient video understanding framework that has an early 2D, late 3D architecture with medium-level temporal fusion. ECO has 150M parameters, which is 3.1x more parameters, and requires 267G FLOPs, which is 2.7x more computation. Yet, ECO falls 3.3% short in Top1 accuracy compared to TSM.
I3D (with GCN augmentation) is the previous state-of-the-art model which enables all-level temporal fusion. This framework has 62.2M parameters, which is 1.3x more parameters, and requires 606G FLOPs, which is 6.2x more computation. Yet again, TSM produces 3.6% higher Top1 accuracy than the I3D+GCN network.
Table above summarizes the performances of TSM, ECO, and I3D+GCN.

GPU latency and throughput comparison

TSM also displays lower GPU inference latency and higher throughput rates than ECO and I3D. Measured on NVIDIA Tesla P100 GPU, TSM shows inference latency of 17.4ms and throughput of 77.4V/s (Video per second). Meanwhile, ECO has the highest latency of 30.6ms, which is about 175% of TSM’s, and throughput of 45.6V.s, which is about 59% of TSM’s. I3D has latency of 25.8ms, which is about 150% of TSM’s, and throughput of 42.4V/s, which is about 55% of TSM’s. While TSM has the highest Top1 validation accuracy, it still shows the lowest GPU latency and highest video throughput.

Online uni-direction TSM still achieves comparable performance when compared to the offline version. On Kinetics, UCF101, and HMDB51 (which is less sensitive to temporal information), the online version has negligible accuracy difference and performs on par. On Something-Something, the online version outputs 1% lower Top1 validation accuracy, but still has higher accuracies than the 2D and 3D models discussed above.
Real-time online video understanding is an important application for many fields, such as self-driving vehicles or robotics. Due to the low latency, high throughput nature of TSM, it can be injected to 2D backbones of previous online models in return for higher accuracies. Since TSM is also a comparably light model (with only 48.6M parameters), it is mobile device friendly. The authors successfully deployed TSM on mobile devices, such as Raspberry Pi or Samsung Galaxy Note8, and produced competitive results.

In this blog, we took a journey through the evolution of vision video understanding models, and dissected the Temporal Shift Module (TSM). TSM is a zero-cost module that can be inserted into 2D CNN backbone models to provide spatial-temporal inferences. TSM outperforms state-of-the-art 2D and 3D CNN models on various benchmarks, and this efficient module can also be used in online settings or on edge devices for low latency real-time video understanding tasks.

Temporal Shift Module:Zero-Computation, Zero-Parameter Solution for Efficient Video Understanding

Temporal Shift Module:
Zero-Computation, Zero-Parameter Solution for Efficient Video Understanding