Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition

Zhang, Cheng; Zhong, Jianqi; Cao, Wenming; Ji, Jianhua

doi:10.3233/IDA-230399

Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognition

Article type: Research Article

Authors: Zhang, Cheng | Zhong, Jianqi | Cao, Wenming^* | Ji, Jianhua

Affiliations: State Key Laboratory of Radio Frequency Heterogeneous Integration, Shenzhen University, Shenzhen, Guangdong, China

Correspondence: [*] Corresponding author: Wenming Cao, State Key Laboratory of Radio Frequency Heterogeneous Integration, Shenzhen University, Shenzhen, Guangdong, China. E-mail: [email protected].

Abstract: Unsupervised action recognition based on spatiotemporal fusion feature extraction has attracted much attention in recent years. However, existing methods still have several limitations: (1) The long-term dependence relationship is not effectively extracted at the time level. (2) The high-order motion relationship between non-adjacent nodes is not effectively captured at the spatial level. (3) The model complexity is too high when the cascade layer input sequence is long, or there are many key points. To solve these problems, a Multiple Distilling-based spatial-temporal attention (MD-STA) networks is proposed in this paper. This model can extract temporal and spatial features respectively and fuse them. Specifically, we first propose a Screening Self-attention (SSA) module; this module can find long-term dependencies in distant frames and high-order motion patterns between non-adjacent nodes in a single frame through a sparse metric on dot product pairs. Then, we propose the Frames and Keypoint-Distilling (FKD) module, which uses extraction operations to halve the input of the cascade layer to eliminate invalid key points and time frame features, thus reducing time and memory complexity. Finally, the Dim-reduction Fusion (DRF) module is proposed to reduce the dimension of existing features to further eliminate redundancy. Numerous experiments were conducted on three distinct datasets: NTU-60, NTU-120, and UWA3D, showing that MD-STA achieves state-of-the-art standards in skeleton-based unsupervised action recognition.

Keywords: 3D human motion prediction, distilling, unsupervised, attention

DOI: 10.3233/IDA-230399

Journal: Intelligent Data Analysis, vol. 28, no. 4, pp. 921-941, 2024

Published: 17 July 2024

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia