Language-based Sample-efficient and Synthesizable Molecular Generative Design
In recent years, machine learning-based generative models have emerged as promising tools for de novo molecular design, offering a data-driven means of proposing novel small molecules with tailored properties. These models are increasingly applied in drug discovery, with generated molecules experimentally validated and currently in clinical trials. However, two persistent challenges limit translational practicality: The ability to efficiently tailor the generation towards (computationally predicted) property-optimal molecules (sample efficiency), and the requirement that generated molecules are synthetically accessible using plausible, known chemistry. Focusing on language-based molecular generative models, this thesis makes contributions toward both challenges.
Part I starts by investigating design features that can enhance the sample efficiency of reinforcement learning-based molecular optimization, and the imposed trade-offs. Using these insights, a molecular generative framework is proposed that can perform optimization under highly constrained computational budgets. The model is applied across multi-parameter optimization tasks spanning drug discovery and adjacent fields such as functional materials design. Part II builds on this framework and addresses increasingly granular definitions of synthesizability by coupling machine learning-based retrosynthesis models that output predicted synthesis routes given input molecules. By comparing the synthesizability of generated molecules as guided by heuristic scores and retrosynthesis models, our analysis highlights out-of-distribution limitations of the former and demonstrates a practical advantage for considering explicit synthesis routes. The framework is subsequently extended to enable steerable and granular synthesizability control. Accordingly, generated molecules have associated synthesis routes incorporating specific chemical reagents, specific reactions, and avoiding other reactions. The ability to control reaction constraints also shows the potential to unify generative design and ultra-large-scale (> billion scale) virtual screening. Specifically, the framework can generate property-optimal molecules with exact matches in so called, "make-on-demand" molecular libraries, i.e., directly purchasable from a vendor, thus performing retrieval. The molecular generative models developed in this thesis lower the barrier for experimental validation and make contributions toward enhanced translational practicality.
EPFL_TH12012.pdf
Main Document
Not Applicable (or Unknown)
openaccess
N/A
62.59 MB
Adobe PDF
dab6b21d27541018ea382d8d9895101e