Files

Abstract

Human motion analysis and synthesis is integral to many computer vision applications, from autonomous driving to sports analysis. In this thesis, we address several problems in this domain. First we consider active viewpoint selection for pose estimation where we choose the next viewpoint of the camera so that we obtain accurate 3D pose estimations across time. Afterwards we consider motion prediction, which is the task of predicting future human motion sequences given past ones. Finally, we address the application-based problem of providing automated physical exercise feedback by analyzing the motion. For any human motion analysis framework, it is necessary to first obtain the 3D human pose from images. We consider a variant of this problem using a moving camera: within a 3D pose estimation framework, our goal is to choose the next best viewpoint to obtain accurate pose estimation results. We design an active viewpoint selection algorithm that uses uncertainty as a proxy for estimating potential error values. The camera moves to the candidate viewpoint or trajectory that has the least uncertainty of the 3D pose estimation. We compare against naive baselines such as constant rotation and random viewpoint selection and show that our active policy achieves more accurate results. In order to build such systems reacting to the human motion, one must also have a good estimate of the future state. Therefore, we study the problem of motion prediction from observed past poses, both for time horizons of 1 second and 5 seconds. Our first framework focuses on estimating highly accurate futures of up to 1 second. Existing methods observe past sequences of fixed length to predict the future. We design a framework which aggregates features extracted from subsequences of multiple lengths. This information is extracted via Temporal Inception Modules (TIM), where the convolutional kernel sizes are proportional to the length of the input subsequence. We demonstrate that our architecture outperforms existing state-of-the-art methods on mean per joint error metrics up to the future time-horizon of 1 second. We extend our time horizon to 6 seconds and design a framework to predict into the long term future. Many existing motion prediction works fail to synthesize dynamic and realistic human motions over extended time horizons. Our approach uses the most essential poses in the sequence, which we refer to as "keyposes". Keyposes are extracted automatically from the data as the poses which minimize the reconstruction loss of the original sequence. Designing a Gated Recurrent Unit (GRU)-based sequence prediction framework, we observe past keyposes and predict future ones. The introduced method is able to outperform existing state-of-the-art motion prediction methods for a future time-horizon of 5 seconds. We demonstrate that our method produces more dynamic and realistic human motion sequences which are plausible continuations of the observed past. A highly relevant application of human motion analysis and synthesis is for sports. We focus on providing automated feedback to individuals performing physical exercises. Our feedback comes in two forms: we classify the type of mistake the exercise contains, and we provide personalized corrections in the forms of synthesized human motions. Our method achieves 90.9% mistake identification accuracy, and corrects incorrectly performed exercises with 94.2% success.

Details

PDF