Kinetics Pretrained Models

Introduction

Kinetics Human Action Video Dataset is a large-scale video action recognition dataset released by Google DeepMind. It contains around 300,000 trimmed human action videos from 400 action classes. This year (2017), it served in the ActivityNet challenge as the trimmed video classification track. During our participation of the challenge, we have confirmed that our TSN framework published in ECCV 2016 works smoothly on Kinetics. It is also verified that TSN models learned on Kinetics can provide excellent pretraining for other related tasks such as untrimmed video classification and temporal action detection (SSN in ICCV2017).

Due to the huge volume of Kinetics, training action recognition models on Kinetics become really intensive for the academia. But we believe the benefit brought by Kinetics should not be limited to some rich labs or companies with lots of GPUs. In this sense, we release our action recognition models trained with TSN on the Kinetics dataset. For references, we also list the performance comparison of Kinetics and ImageNet pretrained models on two action understanding tasks, i.e. untrimmed video classification and temporal action detection using SSN.

Pretrained Models

The TSN pretrained models including one RGB model and a Optical Flow model for each CNN architecture. We provide pretrained models of two CNN architetures: BNInception, which is used in the original TSN paper, and Inception V3. These models are in our modified Caffe's format.

The top-1 and top-5 accuracies are measured by extracting 25 snippets from each video (standard TSN testing protocol).

Model	Modality	Download Link	Kinetics Val Top-1	Kinetics Val Top-5
BNInception	RGB	BN RGB Pretrained Weights	69.1%	88.7%
BNInception	Optical Flow	BN Flow Pretrained Weights	62.1%	83.9%
BNInception	RGB+Flow (1:1)	-	73.9%	91.1%

Inception V3	RGB	V3 RGB Pretrained Weights	72.5%	90.2%
Inception V3	Optical Flow	V3 Flow Pretrained Weights	62.8%	84.2%
Inception V3	RGB+Flow(1:1)	-	76.6%	92.4%

Transfer Learning Results

We conducted exepriments on untrimmed video classification and temporal action detection using the pretrained models above. The performance is compared against models with only ImageNet pretraining.

Trimmed Video Classification (UCF101)

We finetune the Kinetics pretrained models for trimmed video classification on the UCF101 dataset. We use the TSN framework for finetuning. During testing, we follow the standard TSN protocol to extract only 25 snippets from each video to make the results comparable. We report the average classification accuracy over 3 splits of UCF101.

Model	Pretraining	RGB	Flow	RGB+Flow
BNInception	ImageNet only	85.4%	89.4%	94.9%
BNInception	ImageNet + Kinetics	91.1%	95.2%	97.0%

Inception V3	ImageNet only	-	-	-
Inception V3	ImageNet + Kinetics	93.2%	95.3%	97.3%

Untrimmed Video Classification

Untrimmed video classification is a task for classifying long, untrimmed videos collected from the Internet. For this task we use our challenge winning framework [1] in ActivityNet 2016. The performance is measured by top-1 classification accuracy.

Pretraining	ActivityNet v1.3 Val Top-1	ActivityNet v1.3 Test Top-1
ImageNet only	85.4%	88.7%
ImageNet + Kinetics	88.3%	90.2%

Temporal Action Detection

Temporal action detection aim to detect and annotate human action instances from untrimmed videos. We use our SSN framework [2] in ICCV2017 for this task. It outputs the starting time, ending time, and category for each action instance. The performance is measured by mean average precision (mAP). We use BNInception CNN arch here.

Pretraining	ActivityNet v1.2 Val Average mAP (0.5:0.05:0.95)	THUMOS14 Test mAP@0.5
ImageNet only	26.81%	26.73%
ImageNet + Kinetics	28.57%	31.90%

[1] "CUHK & ETHZ & SIAT submission to ActivityNet challenge 2016", Yuanjun Xiong, et. al., arXiv:1608.00797.
[2] "Temporal Action Detection with Structured Segment Networks", Yue Zhao, et. al., ICCV 2017.

Citation

Please cite the following paper if you use the provided models in your research.

@inproceedings{TSN2016ECCV,
  author    = {Limin Wang and Yuanjun Xiong and Zhe Wang
               and Yu Qiao and Dahua Lin and Xiaoou Tang and Luc {Val Gool}},
  title     = {Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
  booktitle   = {ECCV},
  year      = {2016},
}

Download BibTex

Related Projects

Temporal Segment Networks

The framework used to train the provided models.

[Github Link] [ECCV Paper]
Temporal Action Detection with Structured Segment Networks

The state-of-the-art temporal action detection framework proposed in ICCV2017.

[Github Link] [ICCV Paper]
CES-STAR@ActivityNet 2016

We secured the first place of untrimmed video classification task in ActivityNet Large Scale Action Recognition Challenge 2016, held in conjunction with CVPR'16. The method and models of our submissions are released for research use.

[Github Link] [Notebook Paper]
Caffe

Our modified version of the famous Caffe toolbox featuring MPI-based parallel training and Video IO support. We also introduced the cross-modality training of optical flow networks in this work.

[Github Link] [Tech Report]
Enhanced MV for Real-Time Action Recognition

Enhanced MV-CNN is a real-time action recognition algorithm. It uses motion vector to achieve real-time processing speed and knowledge transfer techniques to improve recognition performance.

[CVPR16 Paper] [Project Page]
Trajectory-Pooled Deep Descriptors (TDD)

The state-of-the-art approach for action recognition before TSN.

[CVPR15 Paper] [Github Link]

Contact

For questions and inquiries, please contact:
Yuanjun Xiong: bitxiong@gmail.com
Yue Zhao: thuzhaoyue@gmail.com

TSN Pretrained Models on Kinetics Dataset

Multimedia Laboratory, CUHK