2Guangzhou Xinhua University, Guangzhou, P.R.China
*E-mail:kaishixu@hotmail.com
Video human action recognition is an important academic issue, and due to the challenges involved in the problem, such as complex scenes, large changes in spatial scale, and irregular deformation of the identified targets, the algorithm generally has some shortcomings such as large parameter quantities and high computational costs. This paper proposes a new computing framework based on lightweight deep learning models, using time-domain Fourier transform to generate motion salience maps to highlight human motion areas in feature extraction in which video intrasegment and inter-segment Feature differences at different time scales are used to obtain action information at different time granularity, to conduct more accurate action model of the human action with varying spatio-temporal scales. Additionally, an action excitation method based on the deformable convolution is proposed to solve problems of irregular deformation, spatial multi-scale changes, and the loss of underlying information as the network depth increases. Data experiments are proposed to verify the effectiveness of the proposed algorithm in terms of computational efficiency and accuracy.
Introduction: The computational frameworks of video human action recognition based on deep learning mainly include CNN, Recurrent Neural Network (RNN) and Transformer. Simonyan et al. [1] proposed a dual-stream CNN for video human action recognition, and Feichtenhofer et al. [2] used a residual network (ResNet) to implement a dual-stream structure, though required additional optical flow images to obtain action features. Recurrent Neural Network (RNN) can model time series data [3], but it cannot fully utilize spatio-temporal information. Tran et al. [4] proposed the C3D network, Tran et al. [5] extended ResNet and proposed the Res3D network, and Diba et al. [6] designed the Temporal 3D Convolutional Network (T3D).
In addition, Girdhar et al. [7] combined Fast-RCNN and Transformer and proposed a Video action transformer network, while Bertasiu et al. [8] proposed a TimeSformer network based on a distributed attention