Feature optimization for atomistic machine learning yields a data-driven construction of the periodic table of the elements

Willatt, Michael John; Musil, Félix; Ceriotti, Michele

doi:10.1039/C8CP05921G

research article

Feature optimization for atomistic machine learning yields a data-driven construction of the periodic table of the elements

Willatt, Michael John

•

Musil, Félix

•

Ceriotti, Michele

2018

Physical Chemistry Chemical Physics

Machine-learning of atomic-scale properties amounts to extracting correlations between structure, composition and the quantity that one wants to predict. Representing the input structure in a way that best reflects such correlations makes it possible to improve the accuracy of the model for a given amount of reference data. When using a description of the structures that is transparent and well-principled, optimizing the representation might reveal insights into the chemistry of the data set. Here we show how one can generalize the SOAP kernel to introduce a distance-dependent weight that accounts for the multi-scale nature of the interactions, and a description of correlations between chemical species. We show that this improves substantially the performance of ML models of molecular and materials stability, while making it easier to work with complex, multi-component systems and to extend SOAP to coarse-grained intermolecular potentials. The element correlations that give the best performing model show striking similarities with the conventional periodic table of the elements, providing an inspiring example of how machine learning can rediscover, and generalize, intuitive concepts that constitute the foundations of chemistry. In the last few years, statistical regression techniques have gained an important place in the toolbox of atomic-scale modelling, making it possible to approximate effectively the properties of systems computed with accurate but demanding electronic structure methods based on a small number of reference calculations.1–3 It is fair to say that most of the recent progress in this field has been associated with the development of representations that encode the fundamental symmetries of the system.1,4–8 After symmetries have been accounted for, however, there is still considerable freedom in how to define the details of an atomic-scale representation. Optimizing the input representation can improve substantially the performance of the regression, by adapting it to the specific structure–property relations associated with a given problem. What is more, in the process one can often recognize correlations that rely on intuitive information on such structure–property relations. In this paper we consider the smooth overlap of atomic positions (SOAP) representation – a popular representation of atomic structure that has been successfully used to build interatomic potentials,9–11 predict molecular properties12 and visualize structural motifs13–15 – and extend it by adapting the representation to the intrinsic length scales of atomic interactions, and by considering “alchemical” correlations between chemical species, which make it possible for instance to exploit the similar behavior of different elements to accelerate learning in very chemically heterogeneous data sets. Not only do these extensions improve significantly the performance of SOAP representations, but they do indeed offer insights into the chemistry of the system, for instance providing a data-driven representation of the similarity between elements that is reminiscent of the periodic table of the elements.