Files

Abstract

Musical and audio signals in general form a major part of the large amount of data exchange taking place in our information-based society. Transmission of high quality audio signals through narrow-band channels, such as the Internet, requires refined methods for modeling and coding sound. The first important step is the development of new analysis techniques able to discriminate between sound components according to effective perceptual criteria. Our ultimate goal is to develop an optimal representation in a psychoacoustical sense, providing minimum rate and minimum "perceptual distortion" at the same time. One of the most challenging aspects of this task is the definition of a good model for the representation of the different components of sound. Musical and speech signals contain both deterministic and stochastic components. In voiced sounds the deterministic part provides the pitch and the global timbre: it is in a sense the fundamental structure of a sound and can be easily represented by means of a very restricted set of parameters. The stochastic part contains what we might call the "life of a sound", that is an ensemble of microfluctuations with respect to an electronic-like/non-evolving sound as well as noise due to the physical excitation system. The reproduction of the latter is of fundamental importance to perceive a sound like a natural one. We faced this challenge by developing a new sound analysis/synthesis method called Fractal Additive Synthesis (FAS). The first step was the definition of a new class of wavelet transforms, namely the Harmonic-Band Wavelet Transform (HBWT). This transform is based on a cascade of Modified Discrete Cosine Transform (MDCT) and Wavelet Transforms (WT). By means of the HBWT, we are able to separate the stochastic from the deterministic components of sound and to treat them separately. The second step was the definition of a model for the stochastic components. The spectra of voiced musical sound have non-zero energy in the sidebands of the spectral peaks. These sidebands contain information relative to the stochastic components. The effect of these components is that the waveform of what we call a pseudo-periodic signal, i.e. the stationary part of voiced sounds, changes slightly from period to period. Our work is based on the experimentally verified assumption that the energy distribution of a sideband of a voiced sound spectrum is mostly shaped like powers of the inverse of the distance from the closest partial. The power spectrum of these pseudo-periodic processes is then modeled by means of a superposition of modulated 1/f components, i.e., by means of what we define as a pseudo-periodic 1/f-like process. The time-scale character of the wavelet transform is well adapted to the selfsimilar behavior of 1/f processes. The wavelet analysis of 1/f noise yields a set of very loosely correlated coefficients that in first approximation can be well modeled by white noise in the synthesis. The fractal properties of the 1/f noise also motivated our choice of the name Fractal Additive Synthesis. The next step was the definition of a model for the deterministic components of voiced sounds, consistent with the HBWT analysis/synthesis method. The model is from some point of view inspired by the sinusoidal models. The two models provide a complete method for the analysis and resynthesis of voiced sounds in the perspective of structured audio (SA) sound representations. For the stationary part of voiced sounds compression, ratios in the range of 10-15:1 are easily achievable. Even better results in terms of data compression can be obtained by taking psychoacoustic criteria into consideration. A psychoacoustic based selection of perceptually relevant parameters was implemented and tested. Compression ratios of 20-30:1, depending on the musical instrument, were achieved. An extension of the method based on a pitch-synchronous version of the HBWT with perfect reconstruction time-varying cosine-modulated filter banks was also studied. This makes the method able to handle, for instance, the slight pitch deviations or the vibrato of a musical tone or more relevant changes of pitch as in a glissando. Finally, the method has been successfully extended to non-harmonic sounds by the introduction and definition of an optimization procedure for the design of non-perfect reconstruction cosine-modulated filter banks with inharmonic band subdivisions. These extensions make FAS more flexible and suitable to analyze, encode, process and resynthesize a large class of musical sounds. The final result of this work is the development of a method for modeling in a flexible way both the stochastic and the deterministic parts of sounds at a very refined perceptual level and with a minimum amount of parameters controlling the synthesis process. In the context of SA the method provides a sound analysis/synthesis tool able to encode and to resynthesize sounds at low rate, while maintaining their natural timbre dynamics for high quality reproduction.

Details

PDF