Understanding the semantics in videos is a complex but crucial task in video analysis. This paper focuses on localizing category-independent events, actions or other semantics in an untrimmed video, referred as salient temporal proposal localization. Traditional methods like sliding window have a high computational cost due to the densely sampling of different video segments. We propose a reinforcement learning based method, which trains a localizer that learns a search policy that, instead of exploring every video segment, finds an optimal search path to locate a salient proposal based on the currently observing video segment in a tree structure, therefore reduces the number of video segments fed into the proposal detector. In each search step, a localizer is trained to iteratively select the next sub-region containing salient proposals to continue the search, and a proposal detector is trained to recognize salient proposal from the sub-regions. The experiments demonstrate that our method is able to precisely detect salient proposals with a comparable recall and with much fewer candidate windows.
Tópico:
Human Pose and Action Recognition
Citaciones:
0
Citaciones por año:
No hay datos de citaciones disponibles
Altmétricas:
0
Información de la Fuente:
FuenteProceedings of the 30th ACM International Conference on Multimedia