Real-world phenomena involve complex interactions between multiple signal modalities. As a consequence, humans are used to integrate at each instant perceptions from all their senses in order to enrich their understanding of the surrounding world. This paradigm can be also extremely useful in many signal processing and computer vision problems involving sets of mutually related signals, called multi-modal signals. The simultaneous processing of multi-modal data can in fact reveal information that is otherwise hidden when considering the different modalities independently. This dissertation deals with the modelling and the analysis of natural multi-modal signals. The challenge consists in representing sets of data streams of different nature, like audio-video sequences, that are interrelated in some complex and unknown manner, in such a way that useful information shared by the different data modalities can be extracted and intuitively used. In this sense signal representation have to make an effort to model the structural properties of the observed phenomenon, so that data are expressed in terms of few, meaningful elements. In fact, if information can be represented using only few components, this means that such components capture its salient characteristics. In order to efficiently represent multi-modal data, we advocate the use of sparse signal decompositions over redundant sets of functions (called dictionaries). In this thesis we consider both application-related and theoretical aspects of multi-modal signal processing. We propose two models for multi-modal signals that explain multi-modal phenomena in terms of temporally-proximal events present in the different modalities. A first simple model is inspired by human perception of multi-modal stimuli and it relies on the representation of the different data streams as sparse sums of dictionary elements. This type of representation allows to intuitively define meaningful events present in the different modalities and to discover correlated multi-modal patterns. Taking inspiration by this first model, we introduce a representational framework for multi-modal data based on their sparse decomposition over dictionaries of multi-modal functions. Instead of separately decompose each modality over a dictionary and seek for correlations between the extracted patterns, we impose some correlation between modalities at the model level. Since such correlations are difficult to formalize, we propose as well a method to learn dictionaries of synchronous multi-modal basis elements. Concerning the applications presented in this dissertation, we tackle two major audiovisual fusion problems, that are audiovisual source localization and separation. Although many of the ideas developed in this work are completely general, we consider this field since it is the one that presents the vastest possibilities of application for this research. The theoretical frameworks developed throughout the thesis are used to localize, separate and extract audio-video sources in audiovisual sequences. Algorithms for cross-modal source localization and blind audiovisual source separation are tested on challenging real-world multimedia sequences. Experiments show that the proposed approach leads to promising results for several newly designed multi-modal signal processing algorithms and that a careful modelling of data structural properties can convey interesting, useful information to understand complex multi-modal phenomena.