In this paper, we introduce probabilistic model based architecture for error handling in human–robot spoken dialogue systems under adverse audio conditions. In this architecture, a Bayesian network framework is used for interpretation of multi-modal signals in the spoken dialogue between a tour-guide robot and visitors in mass exhibition conditions. In particular, we report on experiments interpreting speech and laser scanner signals in the dialogue management system of the autonomous tour-guide robot RoboX, successfully deployed at the Swiss National Exhibition (Expo.02). A correct interpretation of a user's (visitor's) goal or intention at each dialogue state is a key issue for successful voice-enabled communication between tour-guide robots and visitors. To infer the visitors' goal under the uncertainty intrinsic to these two modalities, we introduce Bayesian networks for combining noisy speech recognition with data from a laser scanner, which are independent of acoustic noise. Experiments with real-world data, collected during the operation of RoboX at Expo.02 demonstrate the effectiveness of the approach in adverse environment. The proposed architecture makes it possible to model error-handling processes in spoken dialogue systems, which include complex combination of different multi-modal information sources in cases where such information is available.