Towards Good Practices for Very Deep Two-Stream ConvNets

Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao

Introduction

We discussed some practical strategies for training very deep ConvNets in video-based action recognition.
We obtained the state-of-the-art results on UCF101 dataset. Models are released for non-commercial use.
A fork of the well-known Caffe framework with efficient Multi-GPU training is released to foster future research.
Technical Report on Arxiv.

Models

Please see the following link for the models.

Models and config files.

The modified pre-trained VGG-16 models are also provided

Temporal Initialization Model, Spatial Initialization Model.

Results

Validation accuracy (%) on UCF101 Dataset

Validation Split	Spatial	Temporal	Combined
1	79.8	85.7	90.9
2	77.3	88.2	91.6
3	77.8	87.4	91.6
Average	78.4	87.0	91.4

Code

Github Repository

Please consult the README files in the repository for features and usages.

Optical Flow

Some have reported that there is performance drop when using other video decoders or optical flow algorithms.
Here we provide the optical flow images we extracted on UCF101 dataset for your references.

UCF101 Optical Flow

You are advised to use the same tool to extract optical flow if you plan to directly use the released models.

Dense Optical Flow Extraction Tool

Citation

@article{DBLP:journals/corr/WangXW015,
  author    = {Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao},
  title     = {Towards Good Practices for Very Deep Two-Stream ConvNets},
  journal   = {CoRR},
  volume    = {abs/1507.02159},
  year      = {2015},
  url       = {http://arxiv.org/abs/1507.02159},
}

Trajectory-Pooled Deep-Convolutional Descriptors