In this paper, we propose a new approach for the automatic audio-based out-of-scene detection of audio-visual data, recorded by different cameras, camcorders or mobile phones during social events. All recorded data is clustered to out-of-scene and in-scene datasets based on confidence estimation of cepstral pattern matching with a common master track of the event, recorded by a reference camera. The core of the algorithm is based on perceptual time-frequency analysis and confidence measure based on distance distribution variance. The results show correct clustering in 100% of cases for a real life dataset and surpass the performance of cross correlation while keeping lower system requirements.