An Information Theoretic Approach to Speaker Diarization of Meeting Recordings
In this thesis we investigate a non parametric approach to speaker diarization for meeting recordings based on an information theoretic framework. The problem is formulated using the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. The distance between speech segments is selected as the Jensen-Shannon divergence as it arises from the IB objective function optimization. In the first part of the thesis, we explore IB based diarization with Mel frequency cepstral coefficients (MFCC) as input features. We study issues related to IB based speaker diarization such as optimizing the IB objective function, criteria for inferring the number of speakers. Furthermore, we benchmark the proposed system against a state-of-the-art systemon the NIST RT06 (Rich Transcription) meeting data for speaker diarization. The IB based system achieves similar speaker error rates (16.8%) as compared to a baseline HMM/GMM system (17.0%). This approach being non parametric clustering, perform diarization six times faster than realtime while the baseline is slower than realtime. The second part of thesis proposes a novel feature combination system in the context of IB diarization. Both speaker clustering and speaker realignment steps are discussed. In contrary to current systems, the proposed method avoids the feature combination by averaging log-likelihood scores. Two different sets of features were considered – (a) combination of MFCC features with time delay of arrival features (b) a four feature stream combination that combines MFCC, TDOA, modulation spectrum and frequency domain linear prediction. Experiments show that the proposed system achieve 5% absolute improvement over the baseline in case of two feature combination, and 7% in case of four feature combination. The increase in algorithm complexity of the IB system is minimal with more features. The system with four feature input performs in real time that is ten times faster than the GMM based system.
EPFL_TH4888.pdf
openaccess
1.09 MB
Adobe PDF
60f040fa675ec5ac908b9f6b655a704a