Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Article type: Research Article
Authors: Zhang, Qian | Bai, Enrui; * | Shao, Mingwen | Liang, Hong
Affiliations: Colleague of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China
Correspondence: [*] Corresponding author. Enrui Bai, E-mail: [email protected].
Abstract: Convolutional neural networks (CNNs) and Transformer architectures have traditionally been recognized as the preferred models for addressing computer vision tasks. However, there has been a recent surge in the popularity of networks based on multi-layer perceptron (MLP) structures that do not rely on convolution or attention mechanisms. These MLP architectures have demonstrated exceptional performance in image classification tasks, exhibiting lower time complexity while maintaining high accuracy. In contrast, video classification tasks involve larger amounts of data and necessitate more intricate feature extraction, resulting in increased time and resource consumption. To enhance computational efficiency and minimize resource utilization, we propose a convolution-free and Transformer-free architecture for video classification called Video-MLP for video classification. Video-MLP utilizes a simple MLP structure to learn video features. Specifically, it comprises two types of layers: Spatial-Mixer and Temporal-Mixer, which respectively capture spatial and temporal information. The Spatial-Mixer extracts spatial information from each frame along the height and width dimensions, while the Temporal-Mixer models temporal information for the same spatial positions across frames. To improve the efficiency of spatial-temporal modeling in our network, we used a spatial-temporal information fusion approach to integrate information at different scales. Additionally, we grouped the input data along the time dimension and designed three different grouping schemes when extracting temporal information. The experimental results indicate that Video-MLP achieved accuracy rates of 87.2% on the Kinetics-400 dataset and 75.3% on the SomethingV2 dataset, outperforming models with equivalent computational complexity. Notably, Video-MLP achieved these results without using convolution and attention mechanisms, and without pre-training on large-scale image and video datasets.
Keywords: MLP-based-model, video classification, computer vision, deep learning
DOI: 10.3233/JIFS-240310
Journal: Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, pp. 1-12, 2024
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]