Introduction to Video Object Detection


Deep learning in computer vision has made significant progress and could achieve high accuracy on image object detection task in recent years. Video object detection, a similar but more challenging task, has also been proposed and investigated by many researchers and practitioners. However, different from the image object detection task, video is more complicated since it carries richer contextual and temporal information. Directly applying state-of-the-art still-image object detection frameworks usually results in poor performance. In the blog, I will introduce two papers and discuss how they address this problem.

  1. T-CNN Tubelets with Convolutional Neural Networks for Object Detection from Videos
  2. Towards High Performance Video Object Detection

Limitations of Still-image Object Detector

First, simply applying image detectors to videos would introduce unaffordable computational cost by running models for each frame in videos.

The accuracy is also affected by motion blur, video defocus, and rare poses that only appear in videos but seldom observed in still images.

So, we want to design a new framework or extend current frameworks for video object detection task.

Problem Formulation

For each video clip, algorithms need to produce a set of annotations (fi; ci; si; bi) of frame index fi, class label ci, confidence score si and bounding box bi.

The evaluation metric is the same as the object detection task, where the mean average precision (mAP) on all classes is used.


(a) contains large temporal fluctuations and (b) generates false positives

To address these problems, this paper proposes a framework called T-CNN, where T stands for tubelet, a sequence of bounding boxes. It incorporates temporal information by propagating detection results across adjacent frames locally and adjusting detection confidences globally. It also incorporates contextual information by decreasing detection scores of low-confidence classes.

T-CNN framework

The framework consists of four components:

  1. Still-image Detection generates object region proposals in all the frames in a video and assigns each region proposal with an initial detection score.
  2. Multi-context Suppression first sorts detection scores in descending order and the classes of detection scores below a threshold are considered as low-confidence classes and their scores are subtracted by a certain value. Then, Motion-guided Propagation propagates detection results to adjacent frames according to the mean optical flow vectors, which basically tell you the motion of objects.
  3. Tubelet Re-scoring consists of three steps. It first runs a tracking algorithm to obtain long bounding box sequences. It starts with the bounding boxes with the most confident detections and tracks bidirectionally and stops when tracking confidence is below a threshold. Then, for each tubelet box, it obtains several detections from still-image detectors with some overlaps with the box and uses one with the maximum detection score. The last step classifies tubelets into positive and negative samples and re-scores them to increase the score margins.
  4. Model Combination combines different groups of proposals and different models to generate the final results.

By adding temporal and contextual information, this model improves the results by up to 6.7% compared to still-image detectors. It is the winner of the ImageNet Large-Scale Visual Recognition Challenge in 2015.

Towards High Performance Video Object Detection

  1. Sparse Feature Propagation applies feature network only on sparse key frames (e.g. every 10 frames) and feature maps on non-key frames are propagated from its preceding key frame.
  2. Dense Feature Aggregation applies feature network and propagates on all frames and so every frame is viewed as key frame.

Now in this paper, sparsely recursive feature aggregation is a recursive version of (b). It evaluates feature network and applies recursive feature aggregation only on sparse key frames. The aggregated key frame feature aggregates the rich information from all history key frames, and is then propagated to the next key frame. It retains the feature quality from aggregation but reduces the computational cost.

spatially-adaptive partial feature updating recomputes features on non-key frames wherever propagated features have bad quality. They introduce a feature temporal consistency matrix produced by a sibling branch on the flow network to determine whether the propagated feature is good or not.

Finally, they proposed temporally-adaptive key frame scheduling, a new way to choose key frames. A naive way to choose key frames is to pick a key frame at a pre-fixed rate, like every 10 frames. Here, they choose key frames adaptive to the varying dynamics in the temporal domain. They design a feature consistency indicator and the frame is chosen as key frame if its appearance changes a lot compared to previous frames.

To qualify how each new technology contributes to their previous work, they also did an ablation study and investigated the speed-accuracy trade-off.

  • (c1) compares recursive feature aggregation with dense aggregation. It can achieve 10x speedup with a 2% accuracy loss.
  • (c2) extends (c1) by adding partially updating. It improves the mAP score by almost 2% and keeps the same high speed.
  • (c3) further extends (c2) by using a temporally-adaptive key frame scheduling and further improves mAP 2% at all runtime speed. It achieves the best speed-accuracy trade-off.


Recently, the deep learning community has proposed more and more approaches to this task and kept improving the start-of-the-art performance.