On the design of audio features robust to the album-effect for music information retrieval.
Short-term spectral features – and most notably Mel-Frequency Cepstral Coefficients (MFCCs) – are the most widely used descriptors of audio signals and are deployed in a majority of state-of-the-art Music Information Retrieval (MIR) systems. These descriptors have however demonstrated their limitations in the context of speech processing when training and testing conditions of the system do not match, like e.g. in noisy conditions or under a channel mismatch. A related problem has been observed in the context of music processing. It has indeed been hypothesized that MIR algorithms relying on the use of short-term spectral features were unexpectedly picking up on similarities in the production/mastering qualities of music albums. This problem has been referred to as the album-effect in the literature though it has never been studied in depth. It is showed in this thesis how the album-effect relates to the problem of channel mismatch. A measure of robustness to the album-effect is proposed and channel normalization techniques borrowed from the speech processing community are evaluated to help at improving the robustness of short-term spectral features. Alternatively, longer-term features describing critical-band specialized temporal patterns (TRAPs) are adapted to the context of music processing. It is shown how such features can help at describing either timbre or rhythm content depending on the scale considered for analysis and how robust they are to the album-effect. Contrarily to more classic short-term spectral descriptors, TRAP-based features encode some form of prior knowledge of the problem considered through a trained feature extraction chain. The lack of appropriately annotated datasets raises however some new issues when it comes to training the feature extraction chain. Advanced unsupervised learning strategies are considered in this thesis and evaluated against more traditional supervised approaches relying on coarse-grained annotations such as music genres. Specialized learning strategies and specialized architectures are also proposed to compensate for some inherent variability of the data due either to album-related factors or to the dependence of music signals to the tempo of the performance.