(Wydawnictwo Politechniki Łódzkiej, 2023) Duch, Piotr; Wiatrowska, Izabela; Kapusta, Paweł
Speech emotion recognition (SER) is a crucial aspect of humancomputer
interaction. In this article, we propose a deep learning approach,
using CNN and RNN architectures, for SER using both convolutional and recurrent
neural networks. We evaluated the approach on four audio datasets,
including CREMA-D, RAVDESS, TESS, and EMOVO. Our experiments tested
various feature sets and extraction settings to determine optimal features for
SER. Our results demonstrate that the proposed approach achieves high accuracy
rates and outperforms state-of-the-art algorithms.