In the last years, many multimedia systems have been developed for videoconferencing. These systems, every day better, have still the problem of being manually controlled, or at least partially. In the present work, we try to introduce an approach to a Smart Videoconferencing System, which one should be capable to automatically search who is speaking, focus him and track him. This idea, requires an intelligent computer vision system, capable to find the areas or regions of interest (where something or somebody could be suitable to focus in), understand what is in that region, and act consequently. We have studied two main subjects here. One concerns to the part of finding the region of interest (Focus of attention finding) and the other corresponding to a low level image analysis. In the first one, a moving region detection based on a statistical $