Files

Abstract

The human face plays an essential role in social interactions as it brings information about someone's identity, state of mind, or mood. People are, by nature, very good at catching this non-spoken information. Therefore, scientists have been interested in replicating this skill to improve human-machine interactions over the last decades. This area of research is usually referred to as facial image analysis. It covers a wide range of domains such as face detection, facial landmarks localization, face recognition, or age and gender estimation, to name a few. Even though facial analysis algorithms are successfully used on a daily basis, for instance, to unlock our smartphones with face recognition, much remains to be done to make them more robust in environments where head pose variations and illumination conditions can change drastically. Two decades ago, scientists started to show an increasing interest in algorithms driven by 3D models to overcome the intrinsic drawbacks of classical 2D algorithms. This transition led to techniques synthesizing 3D human faces out of images captured with cameras. The reconstruction task is an ill-posed problem due to the loss of information happening when cameras turn objects into images. After reconstruction, the virtual representation of a face gives access to additional information, such as distances, that would typically be lost with images. Recent advances in deep learning have disrupted the 3D face reconstruction field unlocking new possibilities. The work in this thesis identifies and focuses on ways to increase the robustness of deep learning-based reconstruction systems and the fidelity of the synthesized faces. We first introduce a reference 3D reconstruction system composed of commonly used modules from the literature. This system is used to establish baseline results to compare against and assess the validity of the proposed methods. We investigate ways to increase the robustness of the reconstruction system in the presence of significant head pose variations. We propose to modify the classical training strategy based on recent advances in contrastive learning to impose face parameterization consistency between different viewpoints of the same object. We validate our approach using two use-cases, with synthetic data and with real still images. The proposed method achieves similar performance to our baseline while being more consistent across a wide range of head poses. The resolution of images used while reconstructing 3D faces may not always be high; surveillance camera footage is a perfect example of this situation. We investigate the possibility of learning visual representations of images regardless of the image resolutions in order to make the reconstruction systems robust and less sensitive to image resolutions. The proposed approach is able to perform similarly well across a wide range of image sizes. The last part aims at increasing the fidelity of the synthesized 3D face. Because of how the underlying face models are built, the reconstructed geometry is smooth and lacks depth cues linked to expressions, such as wrinkles. However, these cues play an important role in perceiving the correct facial expression. Therefore, we propose to recover this missing information and applying it on top of reconstructed geometry as corrections in the form of displacement maps. The proposed method is validated on facial expression images and showed improvement over the standard approach.

Details

PDF