Objective perception metrics for audio quality
The ubiquity of modern telecommunication systems has led to a steady growth in the interest in audio quality assessment. With the cumbersome and expensive nature of conducting human listening tests, automated objective measures of audio quality have been developed. Many of the systems that have been developed focus on the assessment of speech leading to a relative lack of work on non-speech signals. In this thesis we explore methods for audio quality assessment that generalize well across signal types, including speech and non-speech audio. We focus on the development of non-intrusive methods that do not have access to the clean signal. This work investigates two main approaches: impairment representation learning and a novel semi-intrusive method. Impairment representation learning involves adapting deep learning models for contrastive learning to focus on distortions in audio signals rather than their content, with the goal of improving the generalization of audio quality metrics across diverse signal types. Experiments where we train the simple machine learning methods, such as K-Nearest Neighbors, and Support Vector Regression, show that both impairment-focused and content-focused representations achieve decent performance for both speech and non-speech signals, even when pre-training is conducted on speech. The semi-intrusive method, which mimics humans’ innate
ability to focus on a signal within a mixture, frames the audio quality assessment task as a multi-modal problem. In this approach, the model is trained to predict audio quality based on both text and audio inputs. Our experiments show that the model outperforms baselines when evaluated across a broad domain; however, its performance on narrow domains lags behind baseline methods.
Objective_perception_metrics_for_audio_quality_Final.pdf
Main Document
http://purl.org/coar/version/c_71e4c1898caa6e32
openaccess
CC BY
2.99 MB
Adobe PDF
956309e6d39538d1d1d464e383be43f2