Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects

Estimating the 3D poses of rigid and articulated bodies is one of the fundamental problems of Computer Vision. It has a broad range of applications including augmented reality, surveillance, animation and human-computer interaction. Despite the ever-growing demand driven by the applications, predicting 3D pose from a 2D image is a challenging and ill-posed problem due to the loss of depth information during projection from 3D to 2D. Although there have been years of research on 3D pose estimation problem, it still remains unsolved. In this thesis, we propose a variety of ways to tackle the 3D pose estimation problem both for articulated human bodies and rigid object bodies by learning robust features and latent representations. First, we present a novel video-based approach that exploits spatiotemporal features for 3D human pose estimation in a discriminative regression scheme. While early approaches typically account for motion information by temporally regularizing noisy pose estimates in individual frames, we demonstrate that taking into account motion information very early in the modeling process with spatiotemporal features yields significant performance improvements. We further propose a CNN-based motion compensation approach that stabilizes and centralizes the human body in the bounding boxes of consecutive frames to increase the reliability of spatiotemporal features. This then allows us to effectively overcome ambiguities and improve pose estimation accuracy. Second, we develop a novel Deep Learning framework for structured prediction of 3D human pose. Our approach relies on an auto-encoder to learn a high-dimensional latent pose representation that accounts for joint dependencies. We combine traditional CNNs for supervised learning with auto-encoders for structured learning and demonstrate that our approach outperforms the existing ones both in terms of structure preservation and prediction accuracy. Third, we propose a 3D human pose estimation approach that relies on a two-stream neural network architecture to simultaneously exploit 2D joint location heatmaps and image features. We show that 2D pose of a person, predicted in terms of heatmaps by a fully convolutional network, provides valuable cues to disambiguate challenging poses and results in increased pose estimation accuracy. We further introduce a novel and generic trainable fusion scheme, which automatically learns where and how to fuse the features extracted from two different input modalities that a two-stream neural network operates on. Our trainable fusion framework selects the optimal network architecture on-the-fly and improves upon standard hard-coded network architectures. Fourth, we propose an efficient approach to estimate 3D pose of objects from a single RGB image. Existing methods typically detect 2D bounding boxes and then predict the object pose using a pipelined approach. The redundancy in different parts of the architecture makes such methods computationally expensive. Moreover, the final pose estimation accuracy depends on the accuracy of the intermediate 2D object detection step. In our method, the object is classified and its pose is regressed in a single shot from the full image using a single, compact fully convolutional neural network. Our approach achieves the state-of-the-art accuracy without requiring any costly pose refinement step and runs in real-time at 50 fps on a modern GPU, which is at least 5X faster than the state of the art.


Advisor(s):
Fua, Pascal
Lepetit, Vincent
Year:
2018
Publisher:
Lausanne, EPFL
Keywords:
Laboratories:
CVLAB




 Record created 2018-09-13, last modified 2019-01-18

Fulltext:
Download fulltext
PDF

Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)