TSN Pretrained Models on Kinetics Dataset

Multimedia Laboratory, CUHK


Kinetics Human Action Video Dataset is a large-scale video action recognition dataset released by Google DeepMind. It contains around 300,000 trimmed human action videos from 400 action classes. This year (2017), it served in the ActivityNet challenge as the trimmed video classification track. During our participation of the challenge, we have confirmed that our TSN framework published in ECCV 2016 works smoothly on Kinetics. It is also verified that TSN models learned on Kinetics can provide excellent pretraining for other related tasks such as untrimmed video classification and temporal action detection (SSN in ICCV2017).

Due to the huge volume of Kinetics, training action recognition models on Kinetics become really intensive for the academia. But we believe the benefit brought by Kinetics should not be limited to some rich labs or companies with lots of GPUs. In this sense, we release our action recognition models trained with TSN on the Kinetics dataset. For references, we also list the performance comparison of Kinetics and ImageNet pretrained models on two action understanding tasks, i.e. untrimmed video classification and temporal action detection using SSN.

Pretrained Models

The TSN pretrained models including one RGB model and a Optical Flow model for each CNN architecture. We provide pretrained models of two CNN architetures: BNInception, which is used in the original TSN paper, and Inception V3. These models are in our modified Caffe's format.

The top-1 and top-5 accuracies are measured by extracting 25 snippets from each video (standard TSN testing protocol).

Model Modality Download Link Kinetics Val Top-1 Kinetics Val Top-5
BNInception RGB file_download BN RGB Pretrained Weights 69.1% 88.7%
BNInception Optical Flow file_download BN Flow Pretrained Weights 62.1% 83.9%
BNInception RGB+Flow (1:1) - 73.9% 91.1%
Inception V3 RGB file_download V3 RGB Pretrained Weights 72.5% 90.2%
Inception V3 Optical Flow file_download V3 Flow Pretrained Weights 62.8% 84.2%
Inception V3 RGB+Flow(1:1) - 76.6% 92.4%

Transfer Learning Results

We conducted exepriments on untrimmed video classification and temporal action detection using the pretrained models above. The performance is compared against models with only ImageNet pretraining.
Trimmed Video Classification (UCF101)
We finetune the Kinetics pretrained models for trimmed video classification on the UCF101 dataset. We use the TSN framework for finetuning. During testing, we follow the standard TSN protocol to extract only 25 snippets from each video to make the results comparable. We report the average classification accuracy over 3 splits of UCF101.
Model Pretraining RGB Flow RGB+Flow
BNInception ImageNet only 85.4% 89.4% 94.9%
BNInception ImageNet + Kinetics 91.1% 95.2% 97.0%
Inception V3 ImageNet only - - -
Inception V3 ImageNet + Kinetics 93.2% 95.3% 97.3%
Untrimmed Video Classification
Untrimmed video classification is a task for classifying long, untrimmed videos collected from the Internet. For this task we use our challenge winning framework [1] in ActivityNet 2016. The performance is measured by top-1 classification accuracy.
Pretraining ActivityNet v1.3 Val Top-1 ActivityNet v1.3 Test Top-1
ImageNet only 85.4% 88.7%
ImageNet + Kinetics 88.3% 90.2%
Temporal Action Detection
Temporal action detection aim to detect and annotate human action instances from untrimmed videos. We use our SSN framework [2] in ICCV2017 for this task. It outputs the starting time, ending time, and category for each action instance. The performance is measured by mean average precision (mAP). We use BNInception CNN arch here.
Pretraining ActivityNet v1.2 Val Average mAP (0.5:0.05:0.95) THUMOS14 Test mAP@0.5
ImageNet only 26.81% 26.73%
ImageNet + Kinetics 28.57% 31.90%
[1] "CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016", Yuanjun Xiong, et. al., arXiv:1608.00797.
[2] "Temporal Action Detection with Structured Segment Networks", Yue Zhao, et. al., ICCV 2017.


Please cite the following paper if you use the provided models in your research.
  author    = {Limin Wang and Yuanjun Xiong and Zhe Wang
               and Yu Qiao and Dahua Lin and Xiaoou Tang and Luc {Val Gool}},
  title     = {Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
  booktitle   = {ECCV},
  year      = {2016},

Related Projects