In this paper we aim to explore what is the most appropriate number of data samples needed when measuring the temporal correspondence between a chosen set of video and audio cues in a given audio-visual sequence. Presently the optimal model that connects statistics of audio and video signals does not exist since one does not know the most appropriate features to be extracted in order to analyze their correlation. Previous approaches assumed simple parametric and non-parametric models for the joint distribution for capturing the complex signal relationships. The main problem in using the standard information theoretic quantities, such as entropy and mutual information, is the accurate estimation of the probability density function from a limited number of data. The main idea is to project the data into a statistically sufficient low-dimensional subspace, suitable for density estimation. Then using a simple parametric model based on assumption of Gaussianity, mutual information is estimated and applied as a measure of correspondence. We exploit how the choice of sample size affects the reliability of the correspondence measure (mutual information) between selected features of the two modalities, audio and video.