Divide and Conquer for Single-frame Temporal Action Localization


Chen Ju1
Peisen Zhao2
Siheng Chen1
Ya Zhang1
Yanfeng Wang1
Qi Tian2

1CMIC, Shanghai Jiao Tong University
2PanGu, Huawei Cloud

ICCV 2021



Paper

Bibtex


Abstract

Single-frame temporal action localization (STAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the one-stage framework but couple the counting goal and the localization goal. This paper proposes a novel two-stage framework for the STAL task with the spirit of divide and conquer. The instance counting stage leverages the location supervision to determine the number of action instances and divide a whole video into multiple video clips, so that each video clip contains only one complete action instance; and the location estimation stage leverages the category supervision to localize the action instance in each video clip. To efficiently represent the action instance in each video clip, we introduce the proposal-based representation, and design a novel differentiable mask generator to enable the end-to-end training supervised by category labels. On THUMOS14, GTEA, and BEOID datasets, our method outperforms state-of-the-art methods by 3.5%, 2.7%, 4.8% mAP on average. And extensive experiments verify the effectiveness of our method.


Visualization Demo for Action: Long Jump



Visualization Demo for Action: Clean And Jerk



Visualization Demo for Action: Volleyball Spiking



Visualization Demo for Action: Diving



Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2019YFB1804304), National Natural Science Foundation of China (No. 61771306), SHEITC (No. 2018-RGZN-02046), 111 plan (No. BP0719010), and STCSM (No. 18DZ2270700), and State Key Laboratory of UHD Video and Audio Production and Presentation.