This paper presents a novel framework to learn sparse represen- tations for audiovisual signals. An audiovisual signal is modeled as a sparse sum of audiovisual kernels. The kernels are bimodal functions made of synchronous audio and video components that can be positioned independently and arbitrarily in space and time. We design an algorithm capable of learning sets of such audiovi- sual, synchronous, shift-invariant functions by alternatingly solving a coding and a learning procedure. The proposed methodology is used to learn audiovisual features from a set of bimodal sequences. The basis functions that emerge are audio-video pairs that capture salient data structures.