Evaluation of audio source separation in the context of 3D audio

The emergence and broader availability of 3D audio systems allows for new possibilities in mixing, post-production and playback of audio content. Used in movie post-production for cinemas, as special effect by disk jockeys for example and even for live concerts, 3D rendering immerses the listener more than ever before. When existing audio material is to be employed, Audio Source Separation (ASS) techniques enable the extraction of single sources from a mixture. Modern mixing approaches for 3D audio do not assign individual gains and delays for each source in every channel. A sound scene is rather designed, with individual sources treated as objects to be placed within a scene. The hardware layer is mostly irrelevant for mixing in such a setting. ASS is therefore a valuable tool to ¿disassemble¿ amore traditional monophonic, stereophonic, or multichannel mix. However, due to the complexity of the ASS problem, extracted sources are subject to degradations. While state-of-the-art objective measures for ASS quality build on monaural auditory models, they don¿t take into account binaural listening and the psychoacoustic phenomena that are involved, such as binaural unmasking. In this thesis, an extension to Perceptive Evaluation Methods for Audio Source Separation (PEASS) [41] is proposed with spatial rendering in mind. Additionally a new binaural model for ASS evaluation in the context of 3D audio is presented. The performance of the basic and extended versions of PEASS, as well as the proposed binaural model is evaluated in two subjective studies. The first study is conducted with binaural spatialisation presented over headphones, while the second experiment uses a 3DWave Field Synthesis (WFS) system. A set of artificial ASS degradation algorithms is proposed and used for the stimuli of the subjective studies. Results of the studies indicate monotonic decrease of the perceived quality as a function of the amounts of degradations introduced. The most important degradation is found to be target distortion, followed by onset misallocation and musical noise-type artifacts. Additionally, spatialising the extracted target source away from the residue or having it louder than the residue negatively affects the results, indicating a perceived quality degradation. In 3D WFS conditions, results show evidence for monaural and binaural unmasking. The performance of the proposed binauralmodel is consistently superior to that of the basic or extended PEASS versions. In the binaural spatialisation experiment, a correlation coefficient of 0.60 between subjective and objective results is achieved, versus 0.57 and 0.53 with the extended and basic PEASS version respectively. For the 3D WFS study, the binaural model achieves 0.67 prediction accuracy whereas both PEASS versions get 0.57. The perceptual validity of the WFS formulation is also verified in a localisation experiment. Vertical localisation is found to be nearly as good as physical source localisation for an extended listening area with localisation precision of 6± - 9±. The response time is also used as an indicator of localisation performance.

Related material