Tarantino, LorenzoGarner, Philip N.Lazaridis, Alexandros2020-02-182020-02-182020-02-18201910.21437/Interspeech.2019-2822https://infoscience.epfl.ch/handle/20.500.14299/166357Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.Self-attention for Speech Emotion Recognitiontext::conference output::conference proceedings::conference paper