Fua, PascalLiu, Weizhe2021-11-122021-11-122021-11-12202110.5075/epfl-thesis-8979https://infoscience.epfl.ch/handle/20.500.14299/183000Human-centered scene understanding is the process of perceiving and analysing a dynamic scene observed through a network of sensors with emphasis on human-related activities. It includes the visual perception of human-related activities from either single image or video sequence. Scene understanding with focus of human-related activities is becoming increasingly popular which results in the demand of algorithms that can efficiently model crowd activity in different real-world scenarios. In this thesis, we exploit human-centered scene understanding through crowd counting. Counting people is a challenging task due to perspective distortion and occlusion. We tackle these problems by developing algorithms to leverage a variety of data modalities including single image, video sequence and scene perspective map. First, we introduce an end-to-end trainable deep architecture for crowd counting that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms previous crowd counting methods, especially when perspective effects are strong. Second, we explicitly model the scale changes and reason in terms of people per square-meter. We show that feeding the perspective model to the network allows us to enforce global scale consistency and that this model can be obtained on the fly from the drone sensors. In addition, it also enables us to enforce physically-inspired temporal consistency constraints that do not have to be learned. This yields an algorithm that outperforms previous methods in inferring crowd density from a moving drone camera especially when perspective effects are strong. Third, for video sequence, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.enscene understandingcrowd countingdeep neural networksHuman-Centered Scene Understanding via Crowd Countingthesis::doctoral thesis