Using posterior probabilities for speech/music discrimination

Automatic speech/music discrimination has been receiving importance recently, for example when large multimedia documents have to be processed by an ASR system, or for indexing and retrieval of such documents. This work presents using outputs of a speech recognition acoustic classifier (neural network) for determining if the signal is speech or something else. We describe two posterior probability measures, entropy and dynamism \cite{williams} and test them on a databases of clean speech and music files, as well as on two broadcast news files containing one speech and one music segment. Likelihood ratio classification is performed on a frame-level, with entropy and dynamism calculated over N frames. The higher value of N (longer segments), the less the error, but classification is good enough up to N=40 frames. Acoustic change detection via the Bayesian Information Criterion \cite{chen} using those entropy and dynamism instead of usual set of acoustic features is also applied to the same two news files. This approach seems to be effective, although it has to be evaluated on more data.

Related material