This paper presents an effective implementation of detection-localization of multiple speech sources with microphone arrays. In particular, the Scaled Conjugate Gradient descent is used for fast and precise localization, within a pre-detected volume of space. The approach is fit for real-time implementation. An unsupervised approach to speech/non-speech discrimination is also proposed. The integrated system is then successfully applied to segmentation of spontaneous multi-party speech, as found in meetings. Based on this system, the unsupervised speaker clustering task is then investigated, using distant microphones only. This task is challenging due to the poor quality of the signal and the fast-changing speaker turns encountered in spontaneous speech. An extension of the BIC criterion to multiple modalities is proposed, allowing to combine the strengths of speaker location information -- useful in the short term -- and acoustic speaker information, i.e. MFCCs -- useful in the longer term. A dramatic improvement in speaker clustering results is obtained by the combined approach, as compared with the acoustic-alone approach, and results are close to those obtained with close-talking microphones. Finally, an initial investigation on automatic audio-visual calibration is exposed.