Towards Integrated Processing of Physiological Signals and Speech
Research in speech processing has largely focused on source-system modeling,
where vocal fold vibrations serve as the source and the vocal cavity's
articulations as the system. Nonetheless, speech production includes a complex
combination of physiological systems including muscular, respiratory,
cognitive, and nervous systems. Variations in these systems can
significantly affect speech. For example, individuals with respiratory or
cardiovascular issues may experience breathlessness that alters their speech.
Parkinson's Disease (PD), a neurodegenerative disorder, can impair speech by
disrupting the muscle control required for articulation.
Variations in cognitive load from mental stress can also
impair speech capabilities. A deeper understanding of the relationship between
speech and physiological signals could enhance existing speech technologies and
lead to new applications particularly in the healthcare domain. In this thesis we
move beyond the traditional speech processing methods and investigate
physiological signals in relation to speech. More specifically, we estimate
breathing patterns and heart rate from speech signals and integrate them into
speech related applications.
We developed end-to-end convolutional neural networks to estimate breathing patterns from raw waveform
speech signals and compared them with models using spectral features. The
evaluation employed standard regression metrics and breathing related
parameters, such as breathing rate, and tidal volume. We showed that both models performed similarly, with raw waveform
models requiring a smaller input window. Our single and cross database analyses
confirmed the generalizability of the models. We also examined the limitations
of the evaluation metrics employed in our study. Additionally, we analysed the
raw waveform based models to understand the information they model. Our experiments
revealed that they rely on the low-frequency components of
the speech signals for accurate estimation of breathing patterns. Furthermore,
we studied neural embeddings extracted from the raw waveform based
models in various applications, including COVID-19 detection from
speech, emotion recognition, and
analysing breathing information differences in natural versus synthetic speech
for presentation attack detection.
We also created models to estimate cardiac parameters like heart rate from
speech using acoustic features and neural embeddings derived from
self-supervised learning models. We found significant speaker dependent
variability in performance. Additionally, our approach was validated on two
datasets, producing consistent trends and confirming model generalizability.
Finally, we studied the feasibility of applying the developed methodologies in a
clinical setting by detecting hypoglycemic states in diabetic patients through
speech analysis. For this, we employed neural embeddings from breathing pattern estimation
networks alongside other neural embeddings and acoustic features. We also examined the performance of heart rate
estimation models in this context. As part of this research, we compiled two
novel datasets with simultaneous recordings of speech and physiological
signals, one of which was collected in a clinical environment. These datasets
were used to evaluate the performance of the developed models.
Prof. Dimitri Nestor Alice Van De Ville (président) ; Prof. Daniel Gatica-Perez, Dr Mathew Magimai Doss (directeurs) ; Prof. Jean-Philippe Thiran, Dr Vikramjit Mitra, Dr Milos Cernak (rapporteurs)
2025
Lausanne
2025-03-28
10679
160