Joint head tracking and pose estimation for visual focus of attention recognition

During the last two decades, computer science what are the ability to give to provide to machines in order to give them the ability to understand human behavior. One of them which is an important key to understand human behaviors, is the visual focus of attention (VFOA) of a person, which can be inferred from the gaze direction. The VFOA of a person gives insight about information such as who or what is the interest of a person, who is the target of the person's speech, who the person is listening to. Our interest in this thesis is to study people's VFOA using computer vision techniques. To estimate the VFOA of attention from a computer vision point of view, it is required to track the person's gaze. Because, tracking the eye gaze is impossible on low or mid resolution images, head orientation can be used as a surrogate for the gaze direction. Thus, in this thesis, we investigate in a first step the tracking of people's head orientation, in a second step the recognition of their VFOA from their head orientation. For the head tracking, we consider probabilistic methods based on sequential Monte Carlo (SMC) techniques. The head pose space is discretized into a finite set of poses, and Multi-dimensional Gaussian appearance models are learned for each discrete pose. The discrete head models are embedded into a mixed state particle filter (MSPF) framework to jointly estimate the head location and pose. The evaluation shows that this approach works better than the traditional paradigm in which the head is first tracked then the head pose is estimated. An important contribution of this thesis is the head pose tracking evaluation. As people usually evaluate their head pose tracking methods either qualitatively or with private data, we built a head pose video database using a magnetic field 3D location and orientation tracker. The database was used to evaluate our tracking methods, and was made publicly available to allow other researchers to evaluate and compare their algorithms. Once the head pose is available, the recognition of the VFOA can be done. Two environments are considered to study the VFOA: a meeting room environment and an outdoor environment. In the meeting room environment, people are static. People's VFOAs were studied depending on their locations in the meeting room. The set of VFOAs for a person is discretized into a finite set of targets: the other people attending the meeting, the table, the slide screen, and another VFOA target called un-focused denoting that the person is focusing none of the previous defined VFOAs. The head poses are used as observations and potential VFOA targets as hidden states in a Gaussian mixture model (GMM) or a hidden Markov model (HMM) framework. The parameters of the emission probability distributions were learned by two ways. A first way using head pose training data, and a second way exploiting the geometry of the room and the head and eye-in-head rotations. Maximum a posteriori adaptation (MAP) of the VFOA models was to the input test data to take into account people personal ways of gazing at VFOA targets. In the outdoor environment, people are moving and there is a single VFOA target. The problem in this study is to track multiple people passing and estimate whether or not they were focusing the advertisement. The VFOA is modeled as a GMM having as observations people's head location and pose.

Bourlard, Hervé
Odobez, Jean-Marc
Lausanne, EPFL
Other identifiers:
urn: urn:nbn:ch:bel-epfl-thesis3764-2

Note: The status of this file is: EPFL only

 Record created 2007-02-06, last modified 2018-03-17

Texte intégral / Full text:
Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)