Files

Abstract

This Master thesis investigates the application of Bidirectional Encoder Representations from Transformers (BERT) on podcast to identify the host and detect structuring questions within each episode. This research is conducted on an annotated dataset of automatic transcriptions of 38 French podcasts of Radio France and 37 TV shows in English of France 24. A variety of BERT models, with different language orientations, are tested and compared on two classifying tasks: the detection of host sentences and the classification of structuring questions. The latter is firstly performed as a three label classification task. Secondly, a reduction to a binary classifier is proposed, with two new configurations. Initially, BERT models are fine-tuned separately on French and English datasets, as well as on the joint dataset. In a second time, a multilingual approach is implemented with an automatic translation of the original dataset into a total of twenty languages. The translated datasets are used for multilingual fine-tuning and German is included as an evaluation language. BERT models demonstrate adequate performance in host detection to pinpoint within the list of the speakers the actual host of the show, as well as a proposed comparison rule-based method. For structuring question detection, the three label classifier appears too subtle, at least regarding the size of fine-tuning data. One binary classification configuration yields promising results. The multilingual experiment shows that automatic translation has potential as a source of fine-tuning data and highlight the need for original testing data in these languages.

Details

PDF