Abstract

Weakly-supervised temporal action localization aims to localize actions from untrimmed long videos with only videolevel category labels. Most previous methods ignore the incompleteness issue of Class Activation Sequences (CAS), suffering from trivial detection results. To tackle this issue, we propose a novel Adaptive Mutual Supervision (AMS) framework with two branches, where the base branch detects the most discriminative action regions, while the supplementary branch localizes the less discriminative action regions through an adaptive sampler. The sampler dynamically updates the inputs for the supplementary branch using a sampling weight sequence negatively correlated with the CAS from the base branch, thus encouraging the supplementary branch to localize the action regions underestimated by the base branch. To promote mutual enhancement between two branches, we further construct mutual location supervision. Each branch adopts the location pseudo-labels generated from the other branch as the localization supervision. By alternately optimizing two branches for multiple iterations, we progressively complete action regions. Extensive experiments on THUMOS14 and ActivityNet1.2 demonstrate that the proposed AMS method significantly outperforms state-of-the-art methods.

Visualization Demo for Action: High Jump

Visualization Demo for Action: Tennis Swing

Visualization Demo for Action: Basketball Dunk

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2020YFB1406801), 111 plan (No. BP0719010), and STCSM (No. 18DZ2270700), and State Key Laboratory of UHD Video and Audio Production and Presentation.