Introduction to Video Object Detection


Deep learning in computer vision has made significant progress and could achieve high accuracy on image object detection task in recent years. Video object detection, a similar but more challenging task, has also been proposed and investigated by many researchers and practitioners. However, different from the image object detection task, video is more complicated since it carries richer contextual and temporal information. Directly applying state-of-the-art still-image object detection frameworks usually results in poor performance. In the blog, I will introduce two papers and discuss how they address this problem.

Limitations of Still-image Object Detector

Before we dive deep into these papers, we need to first understand what limitations still-image object detectors have and how we can leverage our knowledge to extend current frameworks to videos.

First, simply applying image detectors to videos would introduce unaffordable computational cost by running models for each frame in videos.

The accuracy is also affected by motion blur, video defocus, and rare poses that only appear in videos but seldom observed in still images.

So, we want to design a new framework or extend current frameworks for video object detection task.

Problem Formulation

It is crucial to mathematically define the video object detection first. ImageNet object detection from video (VID) task is the main dataset used in most papers. It contains 30 classes, which is a subset of the object detection (DET) task. All classes are fully labeled in all the frames of each video. The goal is to produce a set of annotations for each video so that each frame is annotated.

For each video clip, algorithms need to produce a set of annotations (fi; ci; si; bi) of frame index fi, class label ci, confidence score si and bounding box bi.

The evaluation metric is the same as the object detection task, where the mean average precision (mAP) on all classes is used.


As I said before, still-image object detectors have limitations on videos and the main reason is that they didn’t incorporate temporal and contextual information. For example, the detection confidences of an object shouldn’t change a lot between adjacent frames. Also, Although still-image detectors have incorporated image context information, the information within a single frame is sometimes not enough to distinguish false positives, like some background objects. The figure below shows these two scenarios that still-image detectors don’t work well on videos, where the top one shows detection confidences change a lot and the bottom one shows some false positives.

(a) contains large temporal fluctuations and (b) generates false positives

To address these problems, this paper proposes a framework called T-CNN, where T stands for tubelet, a sequence of bounding boxes. It incorporates temporal information by propagating detection results across adjacent frames locally and adjusting detection confidences globally. It also incorporates contextual information by decreasing detection scores of low-confidence classes.

T-CNN framework

The framework consists of four components:

  1. Still-image Detection generates object region proposals in all the frames in a video and assigns each region proposal with an initial detection score.
  2. Multi-context Suppression first sorts detection scores in descending order and the classes of detection scores below a threshold are considered as low-confidence classes and their scores are subtracted by a certain value. Then, Motion-guided Propagation propagates detection results to adjacent frames according to the mean optical flow vectors, which basically tell you the motion of objects.
  3. Tubelet Re-scoring consists of three steps. It first runs a tracking algorithm to obtain long bounding box sequences. It starts with the bounding boxes with the most confident detections and tracks bidirectionally and stops when tracking confidence is below a threshold. Then, for each tubelet box, it obtains several detections from still-image detectors with some overlaps with the box and uses one with the maximum detection score. The last step classifies tubelets into positive and negative samples and re-scores them to increase the score margins.
  4. Model Combination combines different groups of proposals and different models to generate the final results.

By adding temporal and contextual information, this model improves the results by up to 6.7% compared to still-image detectors. It is the winner of the ImageNet Large-Scale Visual Recognition Challenge in 2015.

Towards High Performance Video Object Detection

This paper extends previous works with three new techniques. In the figure below, (a) and (b) are their previous works:

  1. Sparse Feature Propagation applies feature network only on sparse key frames (e.g. every 10 frames) and feature maps on non-key frames are propagated from its preceding key frame.
  2. Dense Feature Aggregation applies feature network and propagates on all frames and so every frame is viewed as key frame.

Now in this paper, sparsely recursive feature aggregation is a recursive version of (b). It evaluates feature network and applies recursive feature aggregation only on sparse key frames. The aggregated key frame feature aggregates the rich information from all history key frames, and is then propagated to the next key frame. It retains the feature quality from aggregation but reduces the computational cost.

spatially-adaptive partial feature updating recomputes features on non-key frames wherever propagated features have bad quality. They introduce a feature temporal consistency matrix produced by a sibling branch on the flow network to determine whether the propagated feature is good or not.

Finally, they proposed temporally-adaptive key frame scheduling, a new way to choose key frames. A naive way to choose key frames is to pick a key frame at a pre-fixed rate, like every 10 frames. Here, they choose key frames adaptive to the varying dynamics in the temporal domain. They design a feature consistency indicator and the frame is chosen as key frame if its appearance changes a lot compared to previous frames.

To qualify how each new technology contributes to their previous work, they also did an ablation study and investigated the speed-accuracy trade-off.

  • (c1) compares recursive feature aggregation with dense aggregation. It can achieve 10x speedup with a 2% accuracy loss.
  • (c2) extends (c1) by adding partially updating. It improves the mAP score by almost 2% and keeps the same high speed.
  • (c3) further extends (c2) by using a temporally-adaptive key frame scheduling and further improves mAP 2% at all runtime speed. It achieves the best speed-accuracy trade-off.


These two papers give a brief introduction to video object detection and show different approaches to this problem. However, they both utilize the characteristics (contextually and temporally) of videos and carefully design the models for videos.

Recently, the deep learning community has proposed more and more approaches to this task and kept improving the start-of-the-art performance.