Vision-Based Scene Understanding with Sparsity Promoting Priors

Alahi, Alexandre

doi:10.5075/epfl-thesis-5070

doctoral thesis

Vision-Based Scene Understanding with Sparsity Promoting Priors

2011

Human beings are interested in understanding their environments and the dynamic content that fills their surroundings. For applications ranging from security to marketing, people have installed networks of cameras to capture the dynamic elements of scenes. In this thesis, we propose a complete real-time system to automatically analyze human behavior from any network of cameras. The proposed system leverages mixed networks of fixed and mobile cameras to locate people, track them, and analyze their trajectories. The mathematical frameworks underlying our proposed methods are based on the following claim: The dynamics of a scene are based on a small set of causes, and therefore can be parameterized by a few degrees of freedom. Every processing block of our system is driven by sparsity promoting priors, i.e., just a few elements are sufficient to capture the scene dynamics. We first present our multi-view people localization algorithm that is designed for a network of fixed cameras. An inverse problem with a sparsity constraint is formulated to detect people using the degraded foreground silhouettes extracted by the cameras. To solve this sparsity driven formulation in a manner appropriate for a real-time implementation, we then propose an approach called "Set Covering Occupancy Object Pursuit" (SCOOP) that outperforms the state-of-the-art. Next, we tackle the data association problem of finding correspondences between located people across time. We implement a graph-based greedy approach to reach real-time tracking performance. Unlike the fixed camera networks considered in the first part of this thesis, mobile cameras are uncalibrated and often monitor non-overlapping fields-of-views with other cameras. We propose a "Cascade of Grids of Image Descriptors" (CaGID) with a sparse search to accurately detect and track objects across uncalibrated cameras with non-overlapping fields-of-views. We evaluate the ability of such mixed networks of cameras to alert drivers to a potential collision with pedestrians. For this application, a camera mounted in a vehicle collaborates with a network of fixed cameras installed in a city. Finally, the proposed system is evaluated for coaching and marketing purposes. The behavior of people in sports games and stores is analyzed in real-time with a graph-based algorithm coined "SpotRank". A probability map inspired by the PageRank algorithm is proposed to rank the most salient 'hot spots' based upon mutual flows. Several public data sets have been used to quantitatively and qualitatively evaluate the performance of our system. To our knowledge, it is the first system to capture the behavior of people in crowded environments and analyze this behavior in real-time with sparsity priors.

Name

EPFL_TH5070.pdf

Access type

restricted

Size

45.54 MB

Format

Adobe PDF

Checksum (MD5)

94316d388b6589d3e7a034f0628d6466