Semi-supervised Extraction of Audio-Visual Sources
This report presents a semi-supervised method to jointly extract audio-visual sources from a scene. It consist of applying a supervised method to segment the video signal followed by an automatic process to properly separate the audio track. This approach starts with the user interaction to select the audio-visual target we want to cut. This labels the problem into two parts. One is the foreground, the object we aim to extract. The background is the remaining content which sets the other part. The segmentation method will be performed over the video signal to split its visual content into foreground and background. It is based on describing the video information as a volume. It is composed of pixels and relationships among them which set the features of the problem. This 3D structure can be interpreted as a graph. Therefore the segmentation is achieved by applying a Graph Cuts algorithm which provides the suitable results. In this work the audio separation process is an automatic task. Due to this we need to extract prior information from the audio-visual data available. Therefore a link between audio and video channels is deﬁned by an audio-visual motion map. This is a video sequence which provides the amount of synchronous motion with the audio track. An audio-based video diﬀusion method have been developed to obtain this kind of signal. Applying the video segmentation labelling to the audio-visual motion map we obtain the distribution of the amount motion into foreground and background. According to the range of these values it is possible to asses when the corresponding audio is on/oﬀ. At this point, audio samples of sources can be extract when the audio activity is due to only one. This is the prior information to learn audio models by means of spectral GMM estimation to assess the audio signals. Finally the separation method is applied. It consist of looking for the more suitable couple of states given the mixture spectrum of the total audio track. Therefore we achieve the audio separation goal. In this work we will study in detail all the stages towards the ﬁnal aim of our method, analyzing the performances of our approach.