A large part of computer vision research is devoted to building models
and algorithms aimed at understanding human appearance and behaviour
from images and videos. Ultimately, we want to build automated systems
that are at least as capable as people when it comes to
interpreting humans. Most of the tasks that we want these systems to
solve can be posed as a problem of inference in probabilistic
models. Although probabilistic inference in general is a very hard
problem of its own, there exists a very powerful class of inference
algorithms, variational inference, which allows us to build efficient
solutions for a wide range of problems.
In this thesis, we consider a variety of computer vision problems
targeted at modeling human appearance and behaviour, including
detection, activity recognition, semantic segmentation and facial
geometry modeling. For each of those problems, we develop novel methods
that use variational inference to improve the capabilities
of the existing systems.
First, we introduce a novel method for detecting multiple potentially
occluded people in depth images, which we call DPOM. Unlike many other
approaches, our method does probabilistic reasoning jointly,
and thus allows to propagate knowledge about one part of the image
evidence to reason about the rest. This is particularly
important in crowded scenes involving many people, since it helps to
handle ambiguous situations resulting from severe occlusions. We
demonstrate that our approach outperforms existing methods on multiple
datasets.
Second, we develop a new algorithm for variational inference that
works for a large class of probabilistic models, which includes, among
others, DPOM and some of the state-of-the-art models for semantic
segmentation. We provide a formal proof that our method converges,
and demonstrate experimentally that it brings better performance than
the state-of-the-art on several real-world tasks, which include
semantic segmentation and people detection. Importantly, we show that
parallel variational inference in discrete random fields can be seen
as a special case of proximal gradient descent, which allows us to
benefit from many of the advances in gradient-based optimization.
Third, we propose a unified framework for multi-human scene
understanding which simultaneously solves three tasks: multi-person
detection, individual action recognition and collective activity
recognition. Within our framework, we introduce a novel multi-person
detection scheme, which relies on variational inference and
jointly refines detection hypotheses instead of relying on
suboptimal post-processing. Ultimately, our model takes as an inputs a
frame sequence and produces a comprehensive description of the
scene. Finally, we experimentally demonstrate that our method brings
better performance than the state-of-the-art.
Fourth, we propose a new approach for learning facial geometry with
deep probabilistic models and variational methods. Our model is based
on a variational autoencoder with multiple sets of hidden variables,
which are capturing various levels of deformations, ranging from
global to local, high-frequency ones. We experimentally demonstrate
the power of the model on a variety of fitting tasks. Our model is
completely data-driven and can be learned from a relatively small
number of individuals.
EPFL_TH8680.pdf
openaccess
24.36 MB
Adobe PDF
4556634211ef4d2c77257697243aa416