In this paper we address the problem of inter-preting sensory data for human-robot interaction, especially when gathered from several robots at the same time. After describing motion tracking in this context, we introduce a general framework for situation representation, and how it simplifies extraction of information suitable for complex man-machine dialogs. As a concrete implementation thereof, a narrative description of a complex scene in a public exposition is created. We regard issues of interpreting sensor data in an efficient way and discuss the effects of the number of robots on the results of the scene interpretation to show that our approach is not only scalable but also profits from a growing number of robots.