Robust Audio Segmentation

Audio segmentation, in general, is the task of segmenting a continuous audio stream in terms of acoustically homogenous regions, where the rule of homogeneity depends on the task. This thesis aims at developing and investigating efficient, robust and unsupervised techniques for three important tasks related to audio segmentation, namely speech/music segmentation, speaker change detection and speaker clustering. The speech/music segmentation technique proposed in this thesis is based on the functioning of a HMM/ANN hybrid ASR system where an MLP estimates the posterior probabilities of different phonemes. These probabilities exhibit a particular pattern when the input is a speech signal. This pattern is captured in the form of feature vectors, which are then integrated in a HMM framework. The technique thus segments the audio data in terms of {\it recognizable} and {\it non-recognizable} segments. The efficiency of the proposed technique is demonstrated by a number of experiments conducted on broadcast news data exhibiting real-life scenarios (different speech and music styles, overlapping speech and music, non-speech sounds other than music, etc.). A novel distance metric is proposed in this thesis for the purpose of finding speaker segment boundaries (speaker change detection). The proposed metric can be seen as special case of Log Likelihood Ratio (LLR) or Bayesian Information Criterion (BIC), where the number of parameters in the two models (or hypotheses) is forced to be equal. However, the advantage of the proposed metric over LLR, BIC and other metric based approaches is that it achieves comparable performance without requiring an adjustable threshold/penalty term, hence also eliminating the need for a development dataset. Speaker clustering is the task of unsupervised classification of the audio data in terms of speakers. For this purpose, a novel HMM based agglomerative clustering algorithm is proposed where, starting from a large number of clusters, {\it closest} clusters are merged in an iterative process. A novel merging criterion is proposed for this purpose, which does not require an adjustable threshold value and hence the stopping criterion is also automatically met when there are no more clusters left for merging. The efficiency of the proposed algorithm is demonstrated with various experiments on broadcast news data and it is shown that the proposed criterion outperforms the use of LLR, when LLR is used with an optimal threshold value. These tasks obviously play an important role in the pre-processing stages of ASR. For example, correctly identifying {\it non-recognizable} segments in the audio stream and excluding them from recognition saves computation time in ASR and results in more meaningful transcriptions. Moreover, researchers have clearly shown the positive impact of further clustering of identified speech segments in terms of speakers (speaker clustering) on the transcription accuracy. However, we note that this processing has various other interesting and practical applications. For example, this provides characteristic information about the data (metadata), which is useful for the indexing of audio documents. One such application is investigated in this thesis which extracts this metadata and combines it with the ASR output, resulting in Rich Transcription (RT) which is much easier to understand for an end-user. In a further application, speaker clustering was combined with precise location information available in scenarios like smart meeting rooms to segment the meeting recordings jointly in terms of speakers and their locations in a meeting room. This is useful for automatic meeting summarization as it enables answering of questions like ``who is speaking and where''. This could be used to access, for example, a specific presentation made by a particular speaker or all the speech segments belonging to a particular speaker.

Related material