Extreme Scalability of DFT-Based QM/MM MD Simulations Using MiMiC

We present a highly scalable DFT-based QM/MM implementation developed within MiMiC, a recently introduced multiscale modeling framework that uses a loose-coupling strategy in conjunction with a multiple-program multiple-data (MPMD) approach. The computation of electrostatic QM/MM interactions is parallelized exploiting both distributed-and shared-memory strategies. Here, we use the efficient CPMD and GROMACS programs as QM and MM engines, respectively. The scalability is demonstrated through large-scale benchmark simulations of realistic biomolecular systems employing GGA and hybrid exchange-correlation functionals. We show that the loose-coupling strategy adopted in MiMiC, with its inherent high flexibility, does not carry any significant computational overhead compared to a tight-coupling scheme. Furthermore, we demonstrate that the adopted parallelization strategy enables scaling of up to 13,000 CPU cores with efficiency above 70%, thus making DFT-based QM/MM MD simulations using hybrid functionals at the nanosecond scale accessible.


Introduction
The scale spanned by (bio-)chemical processes ranges from the decomposition of small organic molecules due to attosecond laser pulses to the folding of proteins comprising hundreds of thousands of atoms over several micro-to milliseconds. Exploring the dynamics of those processes requires methods of varying precision and computational cost, where usually a trade-off between the former and the latter have to be found in order to describe chemically relevant time scales and system sizes. In particular, while the system sizes that can be afforded with accurate first-principles methods are constantly growing, the requirement of sampling a sufficiently long time scale often limits the system size to a few hundreds of atoms in Kohn-Sham density functional theory (KS-DFT) based molecular dynamics (MD) and to less than a few dozen for accurate wavefunction-based methods.
One way, in which this computational bottleneck can be overcome is the use of multiscale methods that bridge size and time scales by combining methods of different accuracy and computational efficiency. The parts of the system that are relevant to a chemical process are often small with respect to the overall system size. Thus, they can be described using methods of higher accuracy that would not be viable for a description of the entire system.
On the other hand, it is usually sufficient for the large remainder of the system to be described by less precise but computationally much more efficient models. Such combination of methods ultimately allows one to cover the time scale that is necessary to properly sample the process under investigation.
MiMiC 1 (multiscale modeling in computational chemistry) is a recently introduced framework for multiscale modeling based on a loose-coupling paradigm. Loose coupling allows for maximum flexibility by allowing the framework to be interfaced with different programs with only minimal modifications; however, such an approach should not compromise efficiency.
Alternatively, a tight-coupling strategy is limited to the program it has been designed for, with the immediate advantage of making it easy to minimize the computational overhead of the framework for the very programs it interfaces. This, however, comes at the cost of a loss of flexibility. MiMiC aims at uniting the flexibility of a loose-coupling scheme with the efficiency of a tight-coupling approach by virtue of highly efficient data exchange, thus providing a highly efficient loose-coupling framework.
The MiMiC framework has first been applied in the context of hybrid quantum mechanical/molecular mechanics (QM/MM) simulations, where it does not only interface the QM and MM drivers, but also computes the electrostatic interaction between the QM and MM subsystems. 1 Highly popular for -but not limited to -biochemical applications, the QM/MM approach, allows, for example, a chemically relevant part of a protein (QM) to be simulated using first principles methods, while the rest of the system (MM) can be conveniently described at the force-field level with sufficient accuracy. As such, the QM/MM method has been implemented in various program-specific contexts, often in a tight-coupling fashion. Here, we present the parallelization schemes used in MiMiC and demonstrate that the use of the loosely-coupled MiMiC framework allows for more flexibility without compromising computational efficiency with respect to existing, tight-coupling implementations.
Using systems with varying sizes of QM and MM regions, we provide benchmarks and per-formance studies for MiMiC interfacing between the plane wave/pseudopotential CPMD 2 program as a main MD driver with the classical MD package GROMACS, 3 thus enabling both Born-Oppenheimer (BO) and Car-Parrinello (CP) QM/MM MD to be performed. The CPMD program provides a preexisting, tight-coupling QM/MM interface to GROMOS; 4 its choice as a main driver therefore allows for a direct comparison of the performance of looseand tight-coupling strategies.
The text is organized as follows: We first provide a summary of the QM/MM methodology adopted in MiMiC, 1 followed by a description of the implementation including aspects related to communication between coupled programs and parallelization of QM/MM interactions. We then provide a description of the test systems used for the benchmarks, which are intended to cover a range of typical system sizes. Following this, we present benchmarks showcasing the parallel performance of the MiMiC-based QM/MM implementation for up to several thousand CPU cores, demonstrating superior scaling performance with respect to the preexisting, tight-coupling QM/MM implementation in the CPMD program. We then go on to show that at a large number of cores, parallel efficiency is mainly exhausted due to serial parts of the main driver in CPMD, thus showing that the loose-coupling scheme adopted in MiMiC does not lead to any significant overhead associated to communication, while providing a much more flexible, system-independent framework.

Methodology
The MiMiC framework enables multiscale simulations by coupling external programs, which independently compute contributions belonging to different parts of a molecular system. We refer the reader to our original presentation of MiMiC in Ref. 1 for a more detailed description of the framework itself. Here, we provide some details relevant to the MiMiC-based QM/MM implementation in which CPMD and GROMACS are used to enable QM/MM MD simulations. In this implementation, the CPMD program serves as the main MD driver and additionally computes the QM contributions while GROMACS provides the MM contributions. The MiMiC framework itself is built around a multiple-program multiple-data (MPMD) approach with loosely coupled programs. This allows the external programs to run simultaneously and independently of each other with periodic data exchange, which relies on network-based solutions rather than file-based data exchange (further details are given below). Moreover, it enables separate allocations of processor pools to each of the programs, thus allowing each program to utilize its own optimal parallelization scheme which improves the parallel efficiency of the resulting solution.

QM/MM Theory
In the following, we provide an overview of the additive electrostatic-embedding QM/MM scheme presenting only the key equations. For a more detailed description of the adopted QM/MM methodology, we refer the reader to Ref. 1.
The total energy of the system is split into three contributions (as usual for additive QM/MM schemes): the energy of the QM subsystem, the energy of the MM subsystem, and the QM/MM interaction energy. The QM subsystem is described according to planewave/pseudopotential-based KS-DFT while the MM subsystem is represented by an MM force field. As mentioned before, the QM and MM contributions are computed by CPMD and GROMACS, respectively. Here, we will focus on the QM/MM part, which includes all interactions between the electrons and nuclei in the QM subsystem and the atoms in the MM subsystem.
The QM-MM interactions can be divided into two parts, namely bonded and nonbonded interactions, where the former only appear when there are covalent bonds crossing the QM-MM boundary. The bonded interactions are described classically using the parameters of the same force field that is used in the MM subsystem. Open valencies in the QM subsystem are saturated by adding boundary atoms that are described by a monovalent pseudopotential and thus contribute a single electron to the QM subsystem. The nonbonded interactions consist of electrostatic and van der Waals interactions. The latter are described through the MM force field. Since the bonded and van der Waals interactions are purely classical, we delegate these to the external MM engine. The electrostatic interactions, which include interactions between the MM charges (or multipoles) and the QM nuclei and electrons, are computed by MiMiC. The current implementation in MiMiC supports an electrostatic embedding scheme in which the electrostatic potential from the MM charges (or multipoles) is added to the KS-DFT external potential. This involves the electronic density of the QM subsystem, which is mapped on a grid, typically of the order of 10 6 points, and thus requires special attention, in particular with respect to computational cost and parallel scalability. Evaluating the full electrostatic part of the nonbonded energy exactly is not only very computationally expensive but also exhibits poor scalability with the size of the system, which can be of the order of 10 5 atoms or larger, since we have to compute the full integral over the electronic grid for each of the MM charges (see below). To overcome these limitations, we have previously formulated and implemented a generalized version of the electrostatic coupling scheme by Laio et al.,4 which was shown to yield the same accuracy as an exact electrostatic treatment but at a substantially reduced cost. 1 Within the generalized electrostatic coupling scheme, the MM atoms are grouped into two domains according to their distance from the QM subsystem. The MM atoms that are within a specified distance from the QM subsystem (the short-range domain) interact directly with the QM electron density and nuclei, whereas the remaining MM atoms (the long-range domain) interact through a multipole expansion of the electrostatic potential from the QM electrons and nuclei. In multi-index notation, as introduced in Ref. 1, the electrostatic QM-MM interaction energy can thus be written as whereR QM is the origin of the multipole expansion, for which we use the centroid of the QM subsystem, i.e.R QM = N QM j=1 R QM j /N QM . Finally, eq 1 also involves the unmodified and modified interaction tensors which are generally defined as and r c,a in the latter is the covalent radius of the ath MM atom. The modified interaction tensor is introduced in order to mitigate divergences at short distances thus avoiding the electron spill-out problem which is pervasive in electrostatic embedding QM/MM due to the lack of Pauli repulsion between the QM and MM subsystems. The modified form of the interaction potential was introduced and validated in Ref. 4.
In order to obtain the electronic density of the QM subsystem, we also need to add the potential induced by MM atoms to the external KS potential. The potential is obtained as the functional derivative of the electrostatic QM-MM interaction energy (eq 1) with respect to the electronic density which yields where the first part is the potential from the MM atoms in the short-range domain and the second part is due to the MM atoms in the long-range domain.
The electrostatic QM-MM interactions also give rise to additional forces acting on the QM nuclei and MM atoms. The force on a QM atom due to the MM atoms in the short-range domain can be expressed as with γ being either (1, 0, 0), (0, 1, 0), or (0, 0, 1) representing the Cartesian components of the force. The corresponding force acting on an MM atom in the short-range domain is then The forces acting on QM nuclei due to the MM atoms in the long-range domain are obtained via F es,lr while the corresponding force on an MM atom is

QM/MM MD Workflow
The workflow of a MiMiC-based QM/MM BO-MD simulation using CPMD and GROMACS is illustrated in Figure 1    In the present work, the equations of motion are integrated by CPMD, including those of the MM atoms. This implies that bond, angle, and other constraints among MM atoms must be enforced by CPMD. The (serial) RATTLE algorithm 5 implemented in CPMD as the standard constraint solver is suitable for systems of up to a few hundred atoms which is typical of pure QM systems but may become a computational bottleneck for larger systems (hundreds of thousands of atoms are not uncommon in MM). To overcome this problem, we implemented an improved version of RATTLE proposed by Weinbach and Elber. 6 This improved version takes advantage of sparse linear algebra to impose the constraints, which greatly reduces the associated computational cost. Even though the method is in principle fully parallel, here we implement a version relying only on OpenMP parallel linear algebra.
MPI and hybrid MPI/OpenMP versions of the method will be implemented in the future to further enhance the scalability.

Communication Library
The loose-coupling model used in the MiMiC framework requires communication of data, e.g. atomic positions and forces, between the coupled programs, which are running independently. To achieve this objective MiMiC provides a communication library (CommLib), which has been designed to minimize the coding effort in the external programs used by the MiMiC framework. In fact, CommLib does not interfere with preexisting communication layers within the coupled programs, e.g. their MPI parallelization scheme does not need to be altered. In practice, one simply invokes the send and/or receive routines in CommLib to exchange data between two programs using an ID associated to a specific domain of the multiscale representation (e.g. QM, MM, etc.) which is created during the CommLib initialization.
To avoid interference with the MPI communication layers, which may be present in the coupled programs, CommLib uses client-server inter-communication which is an MPI2 feature. This allows direct exchange of data between processes belonging to different communicators, i.e. the COMM_WORLD communicators of the independent programs. Moreover,

Parallelization Strategies
The MiMiC framework relies on a loose-coupling approach that allows the external programs to run concurrently using their own parallelization scheme, which can be based on, e.g., MPI, OpenMP, or GPGPU. Indeed, as detailed above (see subsection 2.2), the pure QM and work makes use of a plane wave implementation due to the use of CPMD to compute QM contributions, MiMiC is able to support any other type of basis set that allows for a mapping onto real-space grids. A grid-distribution strategy works well as long as the number of MPI processes does not exceed the number of grid planes. To extend the scalability beyond this limit one can use two complementary approaches, task grouping 7-9 and OpenMP fine-level parallelism, or a combination of the two.
In our task group approach, the short-and long-range MM atoms are divided among the groups, and the grid planes are distributed between the MPI processes of each group (see illustration in Fig. 2). Thus, each MPI process receives a subset of short-and long-range MM  atoms and a subset of grid planes, and computes the corresponding quantities of eqs 3-7.
The results of the MPI processes are then suitably summed up to obtain the final values.
Within the OpenMP fine-level parallelism, the grid plane(s) assigned to an MPI process are divided into pencils, and each thread of the process is in charge of a pencil. In the longrange electrostatics, the OpenMP threads are also used to parallelize over the MM atoms, as it is done within the task group approach.
It is worth remarking that the parallelization approach described above has equivalents in the parallelization strategy adopted in CPMD (low-level distribution over planes, 10 task grouping (known as CP groups), 9 and threading 11 ), and that the task group and OpenMP approaches can be combined together to obtain the best overall parallel efficiency of a MiMiC-based QM/MM simulation.

Computational details
The performance of our MiMiC-based QM/MM implementation was tested in a number of benchmarks. These report the speedup (i.e. S N = t 1 t N where N is the number of compute nodes and t 1 is the timing of one node), strong scaling efficiency (P E = S N N * 100%), and absolute timing per MD step. As a rule of thumb, we define a scaling limit as the amount of cores at which the parallel efficiency stays above a 70% threshold. The systems differ by their size and complexity (see below). They are treated using either the BLYP 12,13 or B3LYP 14,15 exchange-correlation (XC) functionals. The former is a typical example of a generalizedgradient approximation (GGA), which scales linearly in a plane wave basis, whereas the latter constitutes an example of a hybrid functional, which scales quadratically with respect to the numbers of orbitals in plane waves. The hybrid functional implementation in CPMD can exploit thousands of cores, 9 thus using B3LYP allows us to examine the scaling limits of MiMiC under highly parallel conditions. Benchmarks were performed on clusters hosted by the Jülich Supercomputing Centre and RWTH Aachen, namely JURECA, 16 (Fig. 4). The protein systems consist of a zinc-binding GB1 mutant 19,20 (Fig. 5) and   Based on the low-level, distributed-memory parallelization scheme of CPMD, efficient scalability can only be expected up to a number of CPU cores that equals the number of FFT planes in the real-space mesh. Higher scalability can be achieved by resorting to shared-memory parallelization; thus the benchmarks using MiMiC were performed using four OpenMP threads per MPI process up to the point where the grid scaling is exhausted. The number of threads per task was chosen because it was found to maximize the performance on the computing clusters used in this work. Attempts to use a number of cores larger than the number of FFT planes, were carried out either using more OpenMP threads per MPI process or by resorting to the high-level, distributed-memory task grouping scheme implemented in CPMD (known as CP groups). 9 The number of task groups, generally, should be an integer divisor of the number of orbitals in order to achieve an even distribution of computations.
Benchmarks for the QM/MM implementation in CPMD are presented based on an MPIonly version of CPMD as the addition of OpenMP threading substantially degraded the performance.

Performance Benchmarks
In this section, we present benchmarks of the MiMiC-based QM/MM implementation using CPMD and GROMACS as the QM and MM engines, respectively, for a set of systems with different QM and MM subsystem sizes. We investigate the scalability using the parallelization strategies presented in Section 2.4, namely the low-level grid distributions using MPI, the higher-level task grouping scheme, and the fine-level OpenMP threading. Here it should be noted that when we in the following refer to task groups in CPMD, this corresponds to using the CP group implementation as presented in Ref. 9. We also include a comparison of timings and scaling performance to the existing tight-coupling QM/MM implementation in the CPMD program.

GB1
We performed a set of benchmarks using the zinc-containing GB1 mutant that were aimed at measuring the scaling performance of both the MiMiC-based QM/MM and the exist-  also noticeable that in the BO benchmark (Fig. 7b), we observed superlinear scaling at core counts below 1,000. The reason for such a peculiar observation is the scaling performance of the plane wave FFT implementation, which has been reported to scale with more than 100% for certain system sizes. 24 This has been attributed to an improved CPU cache utilization as the number of cores increases. For instance, the CPUs in JURECA have 30 MB L3 (level 3) cache shared across all cores in addition to 32 KB and 256 KB of L1 (level 1) and L2 (level 2) caches, respectively, on each core. The size of the FFT mesh for GB1 is estimated to be roughly 100 MB. With two CPUs on a single node, the grid is split between those into 50 MB portions. This means that the portion available for a single CPU does not fit into the cache available on said chip. However, as the number of CPUs increases, the   portion sizes per CPU decrease. First, portions will reach a size where they fit into the L3 cache and, once the grid is decomposed even further, the part of the grid treated by a single core will ultimately fit into the L2 cache. The speedup obtained through better cache utilization due to more fine-grained grid distribution can compensate the slowdown due to the communication overhead up to 40 nodes (960 cores). After this point we do not observe the superlinear behavior anymore.
In Figure 8

MQAE
The MQAE system is relatively small both with respect to QM and MM sizes. In order to benchmark the performance on systems with small and large QM subsystems combined with a small MM subsystem, we ran two sets of benchmarks: in the first one we included only the solute in the QM part and in the second we also added solvent molecules that were within 3.0Å of the solute.
Benchmarks using the small QM system with BLYP show that MiMiC retains high scaling efficiency up to 1,296 cores within BO-MD with a total time per time step equal   to approximately 0.8 seconds (see Fig. 9). With four OpenMP threads per MPI process the scaling via grid decomposition is exhausted at 864 cores. Further scaling of BO-MD is possible by using six threads per process. Using task groups does not extend the scalability any further (last point in Fig. 9b). CP-MD behaves in a similar way allowing use of 1,296 cores with 0.28 seconds per time step. A superlinear behavior is observed near 300 cores, which can be attributed to the phenomena described in subsection 4.1.
The benchmarks based on the B3LYP XC functional show that the use of two task groups extends the scaling to 1,728 cores (see Fig. 10). The scaling is limited by the fact that the number of orbitals in this system is a relatively small prime number (47), which inevitably leads to load imbalance of the exact exchange computation that in turn constitutes the It can be expected that simulations using the larger QM subsystem will have a scaling limit which is at least 1.3 times higher based on the larger dimension of the grid (288 3 ).
However, as seen in Figure 11, the BLYP-based BO-MD simulation can use over 2,300 cores by employing two task groups (which is roughly 1.8 times more compared to the scaling limit of the smaller system), thus bringing the total time per time step down to about 3.

Butanol
This system has a very small QM subsystem (15 atoms) combined with a large MM subsystem (139,495 atoms). Therefore, it is expected that the QM/MM part plays a major role performance-wise.  can be brought down to approximately 0.45 seconds using 960 cores (see Fig. 12c). Further scaling is stalled due the small size of the FFT grid (240 3 ) and the small number of orbitals (16), which limits the benefits of task groups in CPMD. Moreover, there are issues due to the serial integrator that have been pointed out above and, in addition, the time spent on communication between CPMD and GROMACS is non-negligible at this core count. These bottlenecks appear mostly due to the small size of the QM part compared to the substantial extent of the MM subsystem. The detailed breakdown of the time spent computing each of the contributions at the scaling limit is shown in Figure 13b.
BLYP-based BO-MD exhibits a similar scaling behaviour as the BLYP-based CP-MD and scales up to the same limit of 960 cores (Fig. 12a), pinpointing yet again that the scaling differences in the QM part are less important due to the dominant effect of the QM/MM contributions. Again, the efficient parallelization scheme allows us to substantially reduce the time needed for a BO-MD step (1.29 seconds). Further scaling is mostly hampered by the bottlenecks summarized for CP-MD above. In the B3LYP-based simulations, the use of two task groups extends the scaling limit of a BO-MD by a factor of two (Fig. 14). This makes it possible to use up to 1,920 cores with an

ClC-ec1
The ClC-ec1 system has the largest number of MM atoms considered here (150,925). Therefore, QM/MM interactions are of paramount importance for simulations of this system. In order to examine effects of both small and large QM subsystems combined with a large MM subsystem, we also included a setup with a QM subsystem that consists of the smaller QM subsystem and all residues that are within 3Å of it.
The behavior of the BLYP-based simulations with a small QM system is similar to the one of butanol (see Fig. 15). BO-MD scales up to 960 cores with a single time step taking 6.53 seconds. However, there is a drastic decline in the CP-MD performance. For this special case, it is not possible to use more than 480 cores with reasonable efficiency. The reason for this lies in the large size of the MM subsystem and the associated large number of constraints. This, together with the serial velocity-and coordinate-updates in the CPMD integrator, stalls the scaling (see Fig. 16).   The picture is rather different when the small QM subsystem is treated at the B3LYP level. These benchmarks have been run on the JUWELS supercomputer, due to the signifi-  times more points along one dimension, an improvement in the scaling limit is expected due to better scaling in the low-level grid parallelization. The results for this setup can be found in Figure 18. Indeed, BLYP-based BO-MD scales excellently, up to 4,320 cores with an efficiency above 70% (rather than 980 cores as for the smaller QM subsystem). The scaling of CP-MD also shows an impressive improvement, extending the scaling limit observed for the small QM subsystem by as much as a factor of 4.5. The cost of force computations,

Summary
In this work, we presented the parallelization strategy and performance benchmarks of the MiMiC-based QM/MM implementation 1 using CPMD and GROMACS as the QM and MM engines, respectively. We detail the technical and algorithmic improvements that allowed our implementation to outperform the existing QM/MM interface in CPMD by roughly an order of magnitude (in terms of wall time per MD time step). Exceptional parallel efficiency is achieved through an efficient parallelization of the QM/MM contributions and by exploiting the highly parallel CPMD and GROMACS packages, both of which run concurrently which is enabled by the loose coupling MPMD approach used in MiMiC.
We demonstrate efficient utilization of several thousand cores bringing the average time per time step down to the few-seconds range for BO-MD and sub-second range for CP-MD.
A major improvement of the parallel algorithm is achieved via combined use of a MPI-based task group approach and OpenMP threading. This paves the way for QM/MM simulations in the timescale of nanoseconds within the duration of a typical computational project even when using hybrid XC functionals and enables investigations of large and complex biological systems within the plane-wave/pseudopotential DFT-based QM/MM.