Adaptive Mutual Supervision for Weakly-Supervised Temporal Action Localization


Chen Ju1
Peisen Zhao2
Siheng Chen1
Ya Zhang1
Yanfeng Wang1
Xiaoyun Zhang1
Qi Tian2

1CMIC, Shanghai Jiao Tong University
2PanGu, Huawei Cloud

IEEE Transactions on Multimedia



Paper

Bibtex


Abstract

Weakly-supervised temporal action localization aims to localize actions from untrimmed long videos with only videolevel category labels. Most previous methods ignore the incompleteness issue of Class Activation Sequences (CAS), suffering from trivial detection results. To tackle this issue, we propose a novel Adaptive Mutual Supervision (AMS) framework with two branches, where the base branch detects the most discriminative action regions, while the supplementary branch localizes the less discriminative action regions through an adaptive sampler. The sampler dynamically updates the inputs for the supplementary branch using a sampling weight sequence negatively correlated with the CAS from the base branch, thus encouraging the supplementary branch to localize the action regions underestimated by the base branch. To promote mutual enhancement between two branches, we further construct mutual location supervision. Each branch adopts the location pseudo-labels generated from the other branch as the localization supervision. By alternately optimizing two branches for multiple iterations, we progressively complete action regions. Extensive experiments on THUMOS14 and ActivityNet1.2 demonstrate that the proposed AMS method significantly outperforms state-of-the-art methods.


Visualization Demo for Action: High Jump



Visualization Demo for Action: Tennis Swing



Visualization Demo for Action: Basketball Dunk



Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2020YFB1406801), 111 plan (No. BP0719010), and STCSM (No. 18DZ2270700), and State Key Laboratory of UHD Video and Audio Production and Presentation.