We propose a novel method to automatically extract the audio-visual objects that are present in a scene. First, the synchrony between related events in audio and video channels is exploited to identify the possible locations of the sound sources. Video regions presenting a high coherence with the soundtrack are automatically labelled as being part of the audio-visual object. Next, a graph cut segmentation procedure is used to extract the entire object. The proposed segmentation approach includes a novel term that keeps together pixels in regions with high audio- visual synchrony. When longer sequences are analyzed, video signals are divided into groups of frames which are processed sequentially and propagate the information about the source characteristics forward in time. Results show that our method is able to discriminate between audio-visual sources and distracting moving objects and to adapt within a short time delay when sources pass from active to inactive and vice versa.