Tianhao Yan - Authorea

The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in supervised SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. In addition, due to the subjective nature and continuity of emotion, the instance segments in which emotional speech is typically segmented do not fully reflect the true labels and cannot describe dynamic temporal changes. Thus, accurate emotional representation cannot be learnt in the process of feature extraction. In order to overtake these limitations, we propose an end-to-end network only for speech that maps sequences of different lengths to a fixed number of chunks and strictly preserves the order of chunks by adaptively adjusting their overlap. Subsequently, it extracts log-mel spectrogram features from chunk-level segments and feeds them into the Residual Multi-Scale Convolutional Neutral Networks with Transformer(RMSCTx) model framework. Finally, by keeping the order of the chunk-level segments, a temporal domain mean layer is used to further extract utterance-level feature representations. With this method, we perform multidimensional SER, i. e., the prediction of arousal, valence, and dominance. The experimental results on three popular corpora demonstrate not only the superiority of our approach, but also the robustness of the model for SER, showing an improvement of the recognition accuracy in the newest version of the public dataset MSP-Podcast (1.9).