Thiran, Jean-PhilippeMotlicek, PetrZuluaga Gomez, Juan Pablo2024-08-132024-08-132024-08-132024-09-0210.5075/epfl-thesis-10616https://infoscience.epfl.ch/handle/20.500.14299/240713Automatic speech recognition (ASR) and spoken language understanding (SLU) is the core component of current voice-powered AI assistants such as Siri and Alexa. It involves speech transcription with ASR and its comprehension with natural language understanding (NLU) systems. Traditionally, SLU runs on a cascaded setting, where an in-domain ASR system automatically generates the transcripts with valuable semantic information, e.g., named entities and intents. These components have been generally based on statistical approaches with hand-craft features. However, current trends have shifted towards large-scale end-to-end (E2E) deep neural networks (DNN), which have shown superior performance on a wide range of SLU tasks. For example, ASR has seen a rapid transition from traditional hybrid-based modeling to encoder-decoder and Transducer-based modeling. Even though there is an undeniable improvement in performance, other challenges have come into play, such as the urgency and need of large-scale supervised datasets; the need of additional modalities, such as contextual knowledge; massive GPU clusters for training large models; or high-performance and robust large models for complex applications. All of this leads to major challenges. This dissertation explores and propose solutions to these challenges that arise from complex settings. Specifically, we address: (1) How to overcome the data scarcity on hybrid-based and E2E ASR models, i.e., low-resource applications? (2) How to properly integrate contextual knowledge at decoding and training time, which leads to improved models? (3) What is the fastest and best approach to train streaming ASR models from scratch for challenging domains without supervised data? (4) How do we reduce the computational budget required at training and inference time by modifying the state-of-the-art E2E ASR architectures? Similarly, we target some questions from the SLU perspective, such as analysis on the optimal representations to perform cascaded SLU, and exploring other SLU tasks aside from intent and slot filing that can be performed in an E2E fashion. Finally, this dissertation closes by covering STAC-ST and TokenVerse, two novel architectures that can handle ASR and SLU tasks seamlessly in a single model via special tokens.enAutomatic Speech RecognitionSpoken Language UnderstandingConversational SpeechAir Traffic Control CommunicationsEnd-to-End ASRLow-Resource ASRLow-Resource Speech Recognition and Understanding for Challenging Applicationsthesis::doctoral thesis