Abstract

We address the problem of automatically predicting group performance on a task, using multimodal features derived from the group conversation. These include acoustic features extracted from the speech signal, and linguistic features derived from the conversation transcripts. Because much work on social signal processing has focused on nonverbal features such as voice prosody and gestures, we explicitly investigate whether features of linguistic content are useful for predicting group performance. The conclusion is that the best-performing models utilize both linguistic and acoustic features, and that linguistic features alone can also yield good performance on this task. Because there is a relatively small amount of task data available, we present experimental approaches using domain adaptation and a simple data augmentation method, both of which yield drastic improvements in predictive performance, compared with a target-only model.

Details

Actions