Improving Human Pose & Shape Estimation with Explicit and Implicit Priors
Computer vision has made remarkable strides in recent decades, becoming a cornerstone of modern technology with applications ranging from autonomous driving to medical imaging. However, despite these advancements, several core challenges remain, especially in tasks that involve understanding and modeling complex, real-world scenes. One particularly difficult domain is human-related computer vision, where the goal is to accurately estimate human poses, shapes, and movements from visual data. This field is fraught with challenges due to the variability in human appearance, the need for fine-grained details, and the limitations of existing models in handling occlusions, body symmetries, natural human dynamics, and their overreliance on large annotated datasets.
The present thesis addresses these challenges by proposing several novel solutions to key issues in human pose and shape estimation. First, it tackles the problem of inconsistencies in skeleton-based models, where left/right symmetries are often poorly maintained. A method is proposed to enforce symmetry constraints, improving the anatomical plausibility of keypoint-based skeleton models. This contribution enhances the accuracy of skeleton estimates, making them more consistent and realistic.
Next, the thesis addresses the generation of implausible poses by developing a generative prior that restricts pose generation to only realistic body shapes. This ensures that human pose estimators produce plausible outputs, even when dealing with complex body configurations. This method contributes to the robustness of pose estimation models, improving their performance in various applications.
Another issue we handle is the uncontrollable mesh interpenetrations, which is inherent to volume-based representations. This leads to unrealistic body shapes where parts of the body overlap. For this, we introduce a differentiable flow-based solution. Our technique resolves self-intersections while preserving the underlying body shape, ensuring that the estimated meshes are not only accurate but also physically plausible.
Finally, the thesis proposes a solution to the overreliance on large annotated datasets, which are often difficult and costly to collect. By leveraging motion cues such as optical flow, this work demonstrates how models can be trained more effectively in data-scarce environments. This data-efficient supervision approach reduces the dependency on annotated datasets while still maintaining high model performance, opening the door for applications in fields where data collection is limited.
While these advancements represent significant progress in human-related computer vision, certain limitations persist. In conclusion, we discuss these limitations of the proposed methods, suggest potential avenues for improvement, and speculate on the future directions for human-related computer vision research.
EPFL_TH10577.pdf
Main Document
restricted
N/A
48.1 MB
Adobe PDF
b2a6cb5f205287ed6cd8e95b0d4a02ef