Language-based Sample-efficient and Synthesizable Molecular Generative Design

Guo, Jeff

doi:10.5075/epfl-thesis-12012

doctoral thesis

Language-based Sample-efficient and Synthesizable Molecular Generative Design

2025

In recent years, machine learning-based generative models have emerged as promising tools for de novo molecular design, offering a data-driven means of proposing novel small molecules with tailored properties. These models are increasingly applied in drug discovery, with generated molecules experimentally validated and currently in clinical trials. However, two persistent challenges limit translational practicality: The ability to efficiently tailor the generation towards (computationally predicted) property-optimal molecules (sample efficiency), and the requirement that generated molecules are synthetically accessible using plausible, known chemistry. Focusing on language-based molecular generative models, this thesis makes contributions toward both challenges.

Part I starts by investigating design features that can enhance the sample efficiency of reinforcement learning-based molecular optimization, and the imposed trade-offs. Using these insights, a molecular generative framework is proposed that can perform optimization under highly constrained computational budgets. The model is applied across multi-parameter optimization tasks spanning drug discovery and adjacent fields such as functional materials design. Part II builds on this framework and addresses increasingly granular definitions of synthesizability by coupling machine learning-based retrosynthesis models that output predicted synthesis routes given input molecules. By comparing the synthesizability of generated molecules as guided by heuristic scores and retrosynthesis models, our analysis highlights out-of-distribution limitations of the former and demonstrates a practical advantage for considering explicit synthesis routes. The framework is subsequently extended to enable steerable and granular synthesizability control. Accordingly, generated molecules have associated synthesis routes incorporating specific chemical reagents, specific reactions, and avoiding other reactions. The ability to control reaction constraints also shows the potential to unify generative design and ultra-large-scale (> billion scale) virtual screening. Specifically, the framework can generate property-optimal molecules with exact matches in so called, "make-on-demand" molecular libraries, i.e., directly purchasable from a vendor, thus performing retrieval. The molecular generative models developed in this thesis lower the barrier for experimental validation and make contributions toward enhanced translational practicality.

Name

EPFL_TH12012.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

N/A

Size

62.59 MB

Format

Adobe PDF

Checksum (MD5)

dab6b21d27541018ea382d8d9895101e