Machine-learning quantum-chemical properties of molecules and chemical reactions

van Gerwen, Puck

doi:10.5075/epfl-thesis-10980

doctoral thesis

Machine-learning quantum-chemical properties of molecules and chemical reactions

2024

"Physics-inspired" or "Quantum" Machine Learning (ML) models, that integrate physics information such as symmetry constraints and inter-atomic interaction forms, have long facilitated the accurate prediction of quantum-chemical molecular properties. While central to chemistry, related models for chemical reactions have lagged behind. This thesis takes some of the first steps in the field to establish physics-inspired reaction representations, whether constructed for use with separate ML models (e.g., kernel ridge regression, KRR) or trained end-to-end. The Bond-Based Reaction Representation (B2R2)+KRR and 3DREACT, are demonstrated for predictions of thermodynamic and/or kinetic properties of chemical reactions. Unlike for molecular property prediction where benchmark datasets are well-established, this thesis both develops new reaction datasets and selects existing sets from the literature that best test the models. Our physics-inspired reaction models are compared to those built from string-based representations of reactions, 2D substructures of reactants and products, or 2D graphs that superimpose reactants and products using atom-mapping information. This benchmarking effort illustrate the strengths and weaknesses of different models, and which types of datasets they are best suited to. We also test models using challenging scaffold-, molecular size-, and property-based splits, which examine their generalisation capabilities, and highlight challenges still present in the field. All together, these works contribute to establishing the domain of ML for quantum-chemical properties of chemical reactions.

This thesis also introduces tools inspired from mathematical concepts, that had not yet been applied to the domain of physics-inspired machine learning: metric learning for KRR, and integer linear programming (ILP) for training set selection. Metric learning is used to inform the distance metric used to compare structures in KRR, in a supervised setting (i.e., making use of a labelled training set). While it is demonstrated only for molecular properties, it could readily be extended to reaction properties as well. ILP is used to revisit the notion of atom-mapping that is central to ML for reactions, and instead uses atom maps to select similar environments to construct an optimal training set for molecular ML. These two works emphasise the interesting overlap between optimisation methods in mathematics and the myriad of applications in ML for quantum chemistry.

Name

EPFL_TH10980.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

N/A

Size

29.56 MB

Format

Adobe PDF

Checksum (MD5)

8741ba1e7a30b908ad6625ce951a526c