Files

Résumé

Over the last two decades, data-powered machine learning (ML) tools have profoundly transformed numerous scientific fields. In computational chemistry, machine learning applications have permitted faster predictions of chemical properties and provided powerful analytical tools, facilitating the exploration of the chemical space. The original work presented in this thesis leverages the paradigm-shifting influence of ML and focuses on bridging the divide between unsupervised and supervised learning with the overarching objective of improving the predictive power of similarity-based machine learning algorithms such as kernel regression. Despite their widespread use in chemistry, current implementations of kernel regression suffer from biased definitions of similarity between chemical environments. This problem originates from the rigidity of current numerical approaches for encoding molecular information, based on expert-crafted representations. Moreover, it is amplified by the incorrect (yet generalized) assumption that increasing the amount of information encoded in molecular representations unequivocally improves the evaluation of molecular similarity. As a result, the performance of kernel models can be sub-optimal reducing their broad applicability. To overcome such limitations, we introduce a series of statistical tools and methodologies based on supervised dimensionality reduction and metric learning capable of filtering and adapting the features of common molecular representations. This allows tailoring the notion of "molecular similarity" in order to optimize the prediction of specific chemical targets. Using examples such as the exploration of the free-energy landscape of oligopeptides or the prediction of subtle properties associated with the outcome of chemical reactions (for example, enantiomeric excess), we demonstrate how the methods proposed in this thesis unlock the optimal performance of kernel regression and, more generally, of any similarity-based algorithm. Overall, the work within is part of a larger, more comprehensive effort aimed at extending the capabilities of computational modeling to increasingly complex chemical situations by exploiting the latest advances in statistical learning.

Détails

PDF