vampnets for deep learning of molecular kinetics · vampnets for deep learning of molecular...

14
VAMPnets for deep learning of molecular kinetics Andreas Mardt 1 , Luca Pasquali 1 , Hao Wu and Frank Noé * Freie Universität Berlin, Department of Mathematics and Computer Science, Arnimallee 6, 14195 Berlin, Germany 1 equal contribution * correspondence to [email protected] There is an increasing demand for computing the relevant structures, equilibria and long-timescale kinetics of biomolecular processes, such as protein-drug binding, from high-throughput molecular dynamics simulations. Current methods employ transformation of simulated coordinates into struc- tural features, dimension reduction, clustering the dimension-reduced data, and estimation of a Markov state model or related model of the interconversion rates between molecular structures. This handcrafted approach demands a substantial amount of modeling expertise, as poor decisions at any step will lead to large modeling errors. Here we employ the variational approach for Markov processes (VAMP) to develop a deep learning framework for molecular kinetics using neural net- works, dubbed VAMPnets. A VAMPnet encodes the entire mapping from molecular coordinates to Markov states, thus combining the whole data processing pipeline in a single end-to-end framework. Our method performs equally or better than state-of-the art Markov modeling methods and provides easily interpretable few-state kinetic models. INTRODUCTION The rapid advances in computing power and simulation technologies for molecular dynamics (MD) of biomolecules and fluids [1–4], and ab ini- tio MD of small molecules and materials [5, 6], allow the generation of extensive simulation data of complex molecular systems. Thus, it is of high interest to automatically extract statistically rel- evant information, including stationary, kinetic and mechanistic properties. The Markov modeling approach [7–12] has been a driving force in the development of kinetic mod- eling techniques from MD mass data, chiefly as it facilitates a divide-and-conquer approach to in- tegrate short, distributed MD simulations into a model of the long-timescale behavior. State-of- the-art analysis approaches and software pack- ages [4, 13, 14] operate by a sequence, or pipeline, of multiple processing steps, that has been engi- neered by practitioners over the last decade. The first step of a typical processing pipeline is fea- turization, where the MD coordinates are either aligned (removing translation and rotation of the molecule of interest) or transformed into inter- nal coordinates such as residue distances, con- tact maps or torsion angles [4, 13, 15, 16]. This is followed by a dimension reduction, in which the dimension is reduced to much fewer (typi- cally 2-100) slow collective variables (CVs), often based on the variational approach or conforma- tion dynamics [17, 18], time-lagged independent component analysis (TICA) [19, 20], blind source separation [21–23] or dynamic mode decomposi- tion [24–28] – see [29, 30] for an overview. The resulting coordinates may be scaled, in order to embed them in a metric space whose distances correspond to some form of dynamical distance [31, 32]. The resulting metric space is discretized by clustering the projected data using hard or fuzzy data-based clustering methods [11, 13, 33– 37], typically resulting in 100-1000 discrete states. A transition matrix or rate matrix describing the transition probabilities or rate between the discrete states at some lag time τ is then esti- mated [8, 12, 38, 39] (alternatively, a Koopman model can be built after the dimension reduction [27, 28]). The final step towards an easily inter- pretable kinetic model is coarse-graining of the estimated Markov state model (MSM) down to a few states [40–46]. This sequence of analysis steps has been de- veloped by combining physico-chemical intuition and technical experience gathered in the last 10 years. Although each of the steps in the above pipeline appears meaningful, there is no funda- mental reason why this or any other given anal- ysis pipeline should be optimal. More dramati- cally, the success of kinetic modeling currently re- lies on substantial technical expertise of the mod- eler, as suboptimal decisions in each step may deteriorate the result. As an example, failure to select suitable features in step 1 will almost cer- tainly lead to large modeling errors. An important step towards selecting optimal models (parameters) and modeling procedures (hyper-parameters) has been the development of the variational approach for conformation dy- namics (VAC) [17, 18], which offers a way to de- fine scores that measure the optimality of a given kinetic model compared to the (unknown) MD operator that governs the true kinetics underlying the data. The VAC has recently been generalized to the variational approach for Markov processes (VAMP), which allows to optimize models of ar- bitrary Markov processes, including nonreversible arXiv:1710.06012v2 [stat.ML] 20 Dec 2017

Upload: others

Post on 03-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

VAMPnets for deep learning of molecular kinetics

Andreas Mardt1, Luca Pasquali1, Hao Wu and Frank Noé∗Freie Universität Berlin,

Department of Mathematics and Computer Science,Arnimallee 6, 14195 Berlin, Germany

1 equal contribution ∗ correspondence to [email protected]

There is an increasing demand for computing the relevant structures, equilibria and long-timescalekinetics of biomolecular processes, such as protein-drug binding, from high-throughput moleculardynamics simulations. Current methods employ transformation of simulated coordinates into struc-tural features, dimension reduction, clustering the dimension-reduced data, and estimation of aMarkov state model or related model of the interconversion rates between molecular structures.This handcrafted approach demands a substantial amount of modeling expertise, as poor decisionsat any step will lead to large modeling errors. Here we employ the variational approach for Markovprocesses (VAMP) to develop a deep learning framework for molecular kinetics using neural net-works, dubbed VAMPnets. A VAMPnet encodes the entire mapping from molecular coordinates toMarkov states, thus combining the whole data processing pipeline in a single end-to-end framework.Our method performs equally or better than state-of-the art Markov modeling methods and provideseasily interpretable few-state kinetic models.

INTRODUCTION

The rapid advances in computing power andsimulation technologies for molecular dynamics(MD) of biomolecules and fluids [1–4], and ab ini-tio MD of small molecules and materials [5, 6],allow the generation of extensive simulation dataof complex molecular systems. Thus, it is of highinterest to automatically extract statistically rel-evant information, including stationary, kineticand mechanistic properties.The Markov modeling approach [7–12] has been adriving force in the development of kinetic mod-eling techniques from MD mass data, chiefly as itfacilitates a divide-and-conquer approach to in-tegrate short, distributed MD simulations into amodel of the long-timescale behavior. State-of-the-art analysis approaches and software pack-ages [4, 13, 14] operate by a sequence, or pipeline,of multiple processing steps, that has been engi-neered by practitioners over the last decade. Thefirst step of a typical processing pipeline is fea-turization, where the MD coordinates are eitheraligned (removing translation and rotation of themolecule of interest) or transformed into inter-nal coordinates such as residue distances, con-tact maps or torsion angles [4, 13, 15, 16]. Thisis followed by a dimension reduction, in whichthe dimension is reduced to much fewer (typi-cally 2-100) slow collective variables (CVs), oftenbased on the variational approach or conforma-tion dynamics [17, 18], time-lagged independentcomponent analysis (TICA) [19, 20], blind sourceseparation [21–23] or dynamic mode decomposi-tion [24–28] – see [29, 30] for an overview. Theresulting coordinates may be scaled, in order toembed them in a metric space whose distancescorrespond to some form of dynamical distance

[31, 32]. The resulting metric space is discretizedby clustering the projected data using hard orfuzzy data-based clustering methods [11, 13, 33–37], typically resulting in 100-1000 discrete states.A transition matrix or rate matrix describingthe transition probabilities or rate between thediscrete states at some lag time τ is then esti-mated [8, 12, 38, 39] (alternatively, a Koopmanmodel can be built after the dimension reduction[27, 28]). The final step towards an easily inter-pretable kinetic model is coarse-graining of theestimated Markov state model (MSM) down to afew states [40–46].

This sequence of analysis steps has been de-veloped by combining physico-chemical intuitionand technical experience gathered in the last ∼10years. Although each of the steps in the abovepipeline appears meaningful, there is no funda-mental reason why this or any other given anal-ysis pipeline should be optimal. More dramati-cally, the success of kinetic modeling currently re-lies on substantial technical expertise of the mod-eler, as suboptimal decisions in each step maydeteriorate the result. As an example, failure toselect suitable features in step 1 will almost cer-tainly lead to large modeling errors.

An important step towards selecting optimalmodels (parameters) and modeling procedures(hyper-parameters) has been the development ofthe variational approach for conformation dy-namics (VAC) [17, 18], which offers a way to de-fine scores that measure the optimality of a givenkinetic model compared to the (unknown) MDoperator that governs the true kinetics underlyingthe data. The VAC has recently been generalizedto the variational approach for Markov processes(VAMP), which allows to optimize models of ar-bitrary Markov processes, including nonreversible

arX

iv:1

710.

0601

2v2

[st

at.M

L]

20

Dec

201

7

Page 2: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

2

and non-stationary dynamics [47]. The VAC hasbeen employed using cross-validation in order tomake optimal hyper-parameter choices within theanalysis pipeline described above while avoidingoverfitting [34, 48]. However, a variational scoreis not only useful to optimize the steps of a givenanalysis pipeline, but in fact allows us to replacethe entire pipeline with a more general learningstructure.Here we develop a deep learning structure thatis in principle able to replace the entire analy-sis pipeline above. Deep learning has been verysuccessful in a broad range of data analysis andlearning problems [49–51]. A feedforward deepneural network is a structure that can learn acomplex, nonlinear function y = F (x). In or-der to train the network a scoring or loss func-tion is needed that is maximized or minimized,respectively. Here we develop VAMPnets, a neu-ral network architecture that can be trained bymaximizing a VAMP variational score. VAMP-nets contain two network lobes that transform themolecular configurations found at a time delay τalong the simulation trajectories. Compared toprevious attempts to include “depth” or “hierar-chy” into the analysis method [52, 53], VAMP-nets combine the tasks of featurization, dimen-sion reduction, discretization and coarse-grainedkinetic modeling into a single end-to-end learningframework. We demonstrate the performance ofour networks using a variety of stochastic modelsand datasets, including a protein folding dataset.The results are competitive with and sometimessurpass the state-of-the-art handcrafted analysispipeline. Given the rapid improvements of train-ing efficiency and accuracy of deep neural net-works seen in a broad range of disciplines, it islikely that follow-up works can lead to superiorkinetic models.

RESULTS

A. Variational principle for Markovprocesses

MD can be theoretically described as a Markovprocess xt in the full state space Ω. For agiven potential energy function, the simulationsetup (e.g. periodic boundaries) and the time-step integrator used, the dynamics are fully char-acterized by a transition density pτ (x,y), i.e. theprobability density that a MD trajectory will befound at configuration y given that it was atconfiguration x a time lag τ before. Markovian-ity implies that the y can be sampled by know-ing x alone, without the knowledge of previoustime-steps. While the dynamics might be highly

nonlinear in the variables xt, Koopman theory[24, 54] tells us that there is a transformation ofthe original variables into some features or la-tent variables that, on average, evolve accord-ing to a linear transformation. In mathematicalterms, there exist transformations to features orlatent variables, χ0(x) = (χ01(x), ..., χ0m(x))

>

and χ1(x) = (χ11(x), ..., χ1m(x))>, such that

the dynamics in these variables are approximatelygoverned by the matrix K:

E [χ1 (xt+τ )] ≈ K>E [χ0 (xt)] . (1)

This approximation becomes exact in the limitof an infinitely large set of features (m −→ ∞)χ0 and χ1, but for a sufficiently large lag timeτ the approximation can be excellent with low-dimensional feature transformations, as we willdemonstrate below. The expectation values E ac-count for stochasticity in the dynamics, such asin MD, and thus they can be omitted for deter-ministic dynamical systems [24, 26, 27].To illustrate the meaning of Eq. (1), consider theexample of xt being a discrete-state Markovchain. If we choose the feature transformation tobe indicator functions (χ0i = 1 when xt = i and0 otherwise, and correspondingly with χ1i andxt+τ ), their expectation values are equal to theprobabilities of the chain to be in any given state,pt and pt+τ , andK = P(τ) is equal to the matrixof transition probabilities, i.e. pt+τ = P>(τ)pt.Previous papers on MD kinetics have usually em-ployed a propagator or transfer operator formu-lation instead of (1) [7, 8]. However, the aboveformulation is more powerful as it also appliesto nonreversible and non-stationary dynamics, asfound for MD of molecules subject to externalforce, such as voltage, flow, or radiation [55, 56].A central result of the VAMP theory is that thebest finite-dimensional linear model, i.e. the bestapproximation in Eq. (1), is found when the sub-spaces spanned by χ0 and χ1 are identical tothose spanned by the topm left and right singularfunctions, respectively, of the so-called Koopmanoperator [47]. For an introduction to the Koop-man operator, please refer to [24, 30, 54].How do we choose χ0, χ1 and K from data?First, suppose we are given some feature trans-formation χ0, χ1 and define the following covari-ance matrices:

C00 = Et[χ0 (xt)χ0 (xt)

>]

(2)

C01 = Et[χ0 (xt)χ1 (xt+τ )

>]

(3)

C11 = Et+τ[χ1 (xt+τ )χ1 (xt+τ )

>]

(4)

where Et [·] and Et+τ [·] denote the aver-ages that extend over time points and lagged

Page 3: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

3

time points within trajectories, respectively,and across trajectories. Then the opti-mal K that minimizes the least square errorEt[∥∥χ1 (xt+τ )−K>χ0 (xt)

∥∥2]is [27, 47, 57]:

K = C−100 C01. (5)

Now the remaining problem is how to find suit-able transformations χ0, χ1. This problem can-not be solved by minimizing the least square errorabove, as is illustrated by the following example:Suppose we define χ0(x) = χ1(x) = (1(x)), i.e.we just map the state space to the constant 1 –in this case the least square error is 0 for K = [1],but the model is completely uninformative as alldynamical information is lost.Instead, in order to seek χ0 and χ1 based onavailable simulation data, we employ the VAMPtheorem introduced in Ref. [47], that can beequivalently formulated as the following subspaceversion:VAMP variational principle: For any twosets of linearly independent functions χ0(x) andχ1(x), let us call

R2 [χ0,χ1] =∥∥∥C− 1

200 C01C

− 12

11

∥∥∥2

F

their VAMP-2 score, where C00,C01,C11 are de-fined by Eqs. (2-4) and ‖A‖2F = n−1

∑i,j A

2ij

is the Frobenius norm of n × n matrix A. Themaximum value of VAMP-2 score is achievedwhen the top m left and right Koopman singu-lar functions belong to span(χ01, ..., χ0m) andspan(χ11, ..., χ1m) respectively.This variational theorem shows that the VAMP-2 score measures the consistency between sub-spaces of basis functions and those of dominantsingular functions, and we can therefore optimizeχ0 and χ1 via maximizing the VAMP-2 score. Inthe special case where the dynamics are reversiblewith respect to equilibrium distribution then thetheorem above specializes to variational principlefor reversible Markov processes [17, 18].

B. Learning the feature transformationusing VAMPnets

Here we employ neural networks to find an opti-mal set of basis functions, χ0(x) and χ1(x). Neu-ral networks with at least one hidden layer areuniversal function approximators [58], and deepnetworks can express strongly nonlinear functionswith a fairly few neurons per layer [59]. Ournetworks use VAMP as a guiding principle andare hence called VAMPnets. VAMPnets consist

Figure 1: Scheme of the neural network architectureused. For each time step t of the simulation trajectory,the coordinates xt and xt+τ are inputs to two deep net-works that conduct a nonlinear dimension reduction. Inthe present implementation, the output layer consists of aSoftmax classifier. The outputs are then merged to com-pute the variational score that is maximized to optimizethe networks. In all present applications the two networklobes are identical clones, but they can also be trained

independently.

of two parallel lobes, each receiving the coordi-nates of time-lagged MD configurations xt andxt+τ as input (Fig. 1). The lobes have m out-put nodes and are trained to learn the transfor-mations χ0(xt) and χ1(xt+τ ), respectively. Fora given set of transformations, χ0 and χ1, wepass a batch of training data through the net-work and compute the training VAMP-2 scoreof our choice. VAMPnets bear similarities withauto-encoders [60, 61] using a time-delay embed-ding and are closely related to deep CanonicalCovariance Analysis (CCA) [62]. VAMPnets areidentical to deep CCA with time-delay embed-ding when using the VAMP-1 score discussed in[47], however the VAMP-2 score has easier-to-handle gradients and is more suitable for timeseries data, due to its direct relation to the Koop-man approximation error [47].The first left and right singular functions of theKoopman operator are always equal to the con-stant function 1(x) ≡ 1 [47]. We can thus add 1

to basis functions and train the network by max-imizing

R2

[(1

χ0

),

(1

χ1

)]=∥∥∥C− 1

200 C01C

− 12

11

∥∥∥2

F+ 1,

(6)where C00, C01, C11 are mean-free covariances ofthe feature-transformed coordinates:

C00 = (T − 1)−1XX> (7)C01 = (T − 1)−1XY> (8)C11 = (T − 1)−1YY>. (9)

Here we have defined the matrices X = [Xij ] =χ0i(xj) ∈ Rm×T and Y = [Yij ] = χ1i(xj+τ ) ∈

Page 4: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

4

Rm×T with (xj ,xj+τ )Tj=1 representing all avail-able transition pairs, and their mean-free versionsX = X− T−1X1, Y = Y − T−1Y1. The gradi-ents of R2 are given by:

∇XR2 =2

T − 1C−1

00 C01C−111

(Y − C>01C

−100 X

)(10)

∇YR2 =2

T − 1C−1

11 C>01C

−100

(X− C01C

−111 Y

)(11)

and are back-propagated to train the two networklobes. See Supplementary Note 1 for derivationsof Eqs. (6,10,11).For simplicity of interpretation we may just use aunique basis set χ = χ0 = χ1. Even when usingtwo different basis sets would be meaningful, wecan unify them by simply defining χ = (χ0,χ1)>.In this case, we clone the lobes of the networkand train them using the total gradient ∇R2 =∇XR2 +∇YR2.After training, we asses the quality of the learnedfeatures and select hyper-parameters (e.g. net-work size) while avoiding overfitting using theVAMP-2 validation score

Rval2 =

∥∥∥(Cval00

)− 12 Cval

01

(Cval

11

)− 12

∥∥∥2

F+ 1, (12)

where Cval00 , C

val01 , C

val11 are mean-free covariance

matrices computed from a validation data set notused during the training.

C. Dynamical model and validation

The direct estimate of the time-lagged covariancematrixC01 is generally nonsymmetric. Hence theKoopman model or Markov state model K givenby Eq. (5) is typically not time-reversible [28].In MD, it is often desirable to obtain a time-reversible kinetic model – see [39] for a detaileddiscussion. To enforce reversibility, K can bereweighted as described in [28] and implementedin PyEMMA [13]. The present results do not de-pend on enforcing reversibility, as classical anal-yses such as PCCA+ [63] are avoided based onthe VAMPnet structure itself.Since K is a Markovian model, it is expected tofulfill the Chapman-Kolmogorov (CK) equation:

K(nτ) = Kn(τ), (13)

for any value of n ≥ 1, where K(τ) and K(nτ) in-dicate the models estimated at a lag time of τ andnτ , respectively. However, since any Markovianmodel of MD can be only approximate [8, 64],Eq. (13) can only be fulfilled approximately, and

the relevant test is whether it holds within statis-tical uncertainty. We construct two tests basedon Eq. (13): In order to select a suitable dy-namical model, we will proceed as for Markovstate models by conducting an eigenvalue decom-position for every estimated Koopman matrix,K(τ)ri = riλi(τ), and computing the impliedtimescales [9] as a function of lag time:

ti(τ) = − τ

ln |λi(τ)|, (14)

We chose a value τ , where ti(τ) are approximatelyconstant in τ . After having chosen τ , we will testwhether Eq. (13) holds within statistical uncer-tainty [65]. For both the implied timescales andthe CK-test we proceed as follows: train the neu-ral network at a fixed lag time τ∗, thus obtainingthe network transformation χ, and then computeEq. (13) or Eq. (14) for different values of τ witha fixed transformation χ. Finally, the approxi-mation of the ith eigenfunction is given by

ψei (x) =∑j

rijχj(x). (15)

If dynamics are reversible, the singular value de-composition and eigenvalue decomposition areidentical, i.e. σi = λi and ψi = ψei .

D. Network architecture and training

We use VAMPnets to learn molecular kineticsfrom simulation data of a range of model sys-tems. While any neural network architecture canbe employed inside the VAMPnet lobes, we chosethe following setup for our applications: the twonetwork lobes are identical clones, i.e. χ0 ≡ χ1,and consist of fully connected networks. In mostcases, the networks have less output than inputnodes, i.e. the network conducts a dimension re-duction. In order to divide the work equally be-tween network layers, we reduce the number ofnodes from each layer to the next by a constantfactor. Thus, the network architecture is definedby two parameters: the depth d and the numberof output nodes nout. All hidden layers employRectified Linear Units (ReLU) [66, 67].Here, we build the output layer with Softmaxoutput nodes, i.e. χi(x) ≥ 0 for all i and∑i χi(x) = 1. Therefore, the activation of an

output node can be interpreted as a probabilityto be in state i. As a result, the network effec-tively performs featurization, dimension reduc-tion and finally a fuzzy clustering to metastablestates, and the K(τ) matrix computed from thenetwork-transformed data is the transition ma-trix of a fuzzy MSM [36, 37, 68]. Consequently,

Page 5: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

5

Eq. (1) propagates probability distributions intime.The networks were trained with pairs of MD con-figurations (xt, xt+τ ) using the Adam stochasticgradient descent method [69]. For each result werepeated 100 training runs, each of which witha randomly chosen 90%/10% division of the datato training and validation data. See Methods sec-tion for details on network architecture, trainingand choice of hyper-parameters.

E. Asymmetric double well potential

We first model the kinetics of a bistable one-dimensional process, simulated by Brownian dy-namics (see Methods) in an asymmetric double-well potential (Fig. 2a). A trajectory of 50, 000time steps is generated. Three-layer VAMPnetsare set up with 1-5-10-5 nodes in each lobe. Thesingle input node of each lobe is given the cur-rent and time-lagged mean-free x coordinate ofthe system, i.e. xt − µ1 and xt+τ − µ2, whereµ1 and µ2 are the respective means, and τ = 1is used. The network maps to five Softmax out-put nodes that we will refer to as states, as thenetwork performs a fuzzy discretization by map-ping the input configurations to the output ac-tivations. The network is trained by using theVAMP-2 score with the four largest singular val-ues.The network learns to place the output states ina way to resolve the transition region best (Fig.2b), which is known to be important for the accu-racy of a Markov state model [8, 64]. This place-ment minimizes the Koopman approximation er-ror, as seen by comparing the dominant Koopmaneigenfunction (Eq. 15) with a direct numericalapproximation of the true eigenfunction obtainedby a transition matrix computed for a direct uni-form 200-state discretization of the x axis – see [8]for details. The implied timescale and Chapman-Kolmogorov tests (Eqs. 14 and 13) confirm thatthe kinetic model learned by the VAMPnet suc-cessfully predicts the long-time kinetics (Fig. 2c,d).

F. Protein folding model

While the first example was one-dimensional wenow test if VAMPnets are able to learn reac-tion coordinates that are nonlinear functions of amulti-dimensional configuration space. For this,we simulate a 100, 000 time step Brownian dy-namics trajectory (Eq. 17) using the simple pro-tein folding model defined by the potential energy

Figure 2: Approximation of the slow transition in abistable potential. (a) Potential energy function U(x) =x4 − 6x2 + 2x. (b) Eigenvector of the slowest processcalculated by direct numerical approximation (black) andapproximated by a VAMPnet with five output nodes(red). Activation of the five Softmax output nodes de-fine the state membership probabilities (blue). (c) Relax-ation timescales computed from the Koopman model usingthe VAMPnet transformation. (d) Chapman-Kolmogorovtest comparing long-time predictions of the Koopmanmodel estimated at τ = 1 and estimates at longer lagtimes. Panels (c) and (d) report 95% confidence interval

error bars over 100 training runs.

function (Supplementary Fig. 1 a):

U(r) =

−2.5 (r − 3)2 r < 3

0.5 (r − 3)3 − (r − 3)2 r ≥ 3

The system has a five-dimensional configurationspace, x ∈ R5, however the energy only dependson the norm of the vector r = |x|. While smallvalues of r are energetically favorable, large val-ues of r are entropically favorable as the numberof configurations available on a five-dimensionalhypersphere grows dramatically with r. Thus,the dynamics are bistable along the reaction co-

Page 6: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

6

ordinate r. Four-layer network lobes with 5-32-16-8-2 nodes each were employed and trained tomaximize the VAMP-2 score involving the largestnontrivial singular value.The two output nodes successfully identify thefolded and the unfolded states, and use inter-mediate memberships for the intersecting tran-sition region (Supplementary Fig. 1 b). Thenetwork excellently approximates the Koopmaneigenfunction of the folding process, as apparentfrom the comparison of the values of the net-work eigenfunction computed by Eq. (15) withthe eigenvector computed from a high-resolutionMSM built on the r coordinate (SupplementaryFig. 1 b). This demonstrates that the net-work can learn the nonlinear reaction coordinatemapping r = |x| based only on maximizing thevariational score 6. Furthermore, the impliedtimescales and the CK-test indicate that the net-work model predicts the long-time kinetics almostperfectly (Supplementary Fig. 1 c,d).

G. Alanine dipeptide

As a next level, VAMPnets are used to learnthe kinetics of alanine dipeptide from simulationdata. It is known that the φ and ψ backbonetorsion angles are the most important reactioncoordinates that separate the metastable statesof alanine dipeptide, however, our networks onlyreceive Cartesian coordinates as an input, and arethus forced to learn both the nonlinear transfor-mation to the torsion angle space and an optimalcluster discretization within this space, in orderto obtain an accurate kinetic model.A 250 nanosecond MD trajectory generated inRef. [70] (MD setup described there) serves asa dataset. The solute coordinates were storedevery picosecond, resulting in 250, 000 configura-tions that are all aligned on the first frame usingminimal RMSD fit to remove global translationand rotation. Each network lobe uses the three-dimensional coordinates of the 10 heavy atomsas input, (x1, y1, z1, ..., x10, y10, z10), and the net-work is trained using time lag τ = 40 ps. Differ-ent numbers of output states and layer depths areconsidered, employing the layer sizing scheme de-scribed in the Methods section (see Fig. 3 for anexample).A VAMPnet with six output states learns a dis-cretization in six metastable sets correspondingto the free energy minima of the φ/ψ space (Fig.4b). The implied timescales indicate that giventhe coordinate transformation found by the net-work, the two slowest timescales are converged atlag time τ = 50 ps or larger (Fig. 4c). Thus weestimated a Koopman model at τ = 50 ps, whose

Figure 3: Represen-tative structure of onelobe of the VAMPnetused for alanine dipep-tide. Here, the five-layernetwork with six outputstates used for the re-sults shown in Fig. 4 isshown. Layers are fullyconnected, have 30-22-16-12-9-6 nodes, and usedropout in the first twohidden layers. All hiddenneurons use ReLu activa-tion functions, while theoutput layer uses Softmaxactivation function in or-der to achieve a fuzzy dis-cretization of state space.

Markov transition probability matrix is depictedin Fig. 4d. Note that transition probabilitiesbetween state pairs 1 ↔ 4 and 2 ↔ 3 are im-portant for the correct kinetics at τ = 50 ps, butthe actual trajectories typically pass via the di-rectly adjacent intermediate states. The modelperforms excellently in the CK-Test (Fig. 4e).

H. Choice of lag time, network depth andnumber of output states

We studied the success probability of optimizinga VAMPnet with six output states as a functionof the lag time τ by conducting 200 optimiza-tion runs. Success was defined as resolving thethree slowest processes by finding three slowesttimescale higher than 0.2, 0.4 and 1 ns, respec-tively. Note that the results shown in Fig. 4are reported for successful runs in this definition.There is a range of τ values from 4 to 32 picosec-onds where the training succeeds with a signifi-cant probability (Supplementary Fig. 2 a). How-ever, even in this range the success rate is stillbelow 40 %, which is mainly due to the fact thatmany runs fail to find the rarely occurring third-slowest process that corresponds to the ψ transi-tion of the positive φ range (Fig. 4b, state 5 and6).The breakdown of optimization success for smalland large lag times can be most easily explainedby the eigenvalue decomposition of Markov prop-agators [8]. When the lag time exceeds thetimescale of a process, the amplitude of this pro-cess becomes negligible, making it hard to fitgiven noisy data. At short lag times, many pro-cesses have large eigenvalues, which increases thesearch space of the neural network and appears toincrease the probability of getting stuck in sub-optimal maxima of the training score.

Page 7: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

7

Figure 4: VAMPnet kinetic model of alanine dipeptide.(a) Structure of alanine dipeptide. The main coordinatesdescribing the slow transitions are the backbone torsionangles φ and ψ, however the neural network inputs areonly the Cartesian coordinates of heavy atoms. (b) As-signment of all simulated molecular coordinates, plottedas a function of φ and ψ, to the six Softmax output states.Color corresponds to activation of the respective outputneuron, indicating the membership probability to the as-sociated metastable state. (c) Relaxation timescales com-puted from the Koopman model using the neural networktransformation. (d) Representation of the transition prob-abilities matrix of the Koopman model; transitions witha probability lower than 0.5% have been omitted. (e)Chapman-Kolmogorov test comparing long-time predic-tions of the Koopman model estimated at τ = 50 ps andestimates at longer lag times. Panels (c) and (e) report95% confidence interval error bars over 100 training runs

excluding failed runs (see text).

We have also studied the success probability, asdefined above, as a function of network depth.Deeper networks can represent more complexfunctions. Also, since the networks defined herereduce the input dimension to the output di-mension by a constant factor per layer, deeper

networks perform a less radical dimension re-duction per layer. On the other hand, deepernetworks are more difficult to train. As seeninSupplementary Fig. 2 b, a high success rateis found for four to seven layers.

Next, we studied the dependency of the network-based discretization as a function of the num-ber of output nodes (Fig. 5a-c). With twooutput states, the network separates the statespace at the slowest transition between negativeand positive values of the φ angle (Fig. 5a).The result with three output nodes keeps thesame separation and additionally distinguishesbetween the α and β regions of the Ramachan-dran plot, i.e. small and large values of the ψangle (Fig. 5b). For higher number of outputstates, finer discretizations and smaller intercon-version timescales are found, until the networkstarts discretizing the transition regions, such asthe two transition states between the α and βregions along the ψ angle (Fig. 5c). We chosethe lag time depending on the number of outputnodes of the network, using τ = 200 ps for twooutput nodes, τ = 60 ps for three output nodes,and τ = 1 ps for eight output nodes.

An network output with k Softmax neurons de-scribes a (k − 1)-dimensional feature space asthe Softmax normalization removes one degreeof freedom. Thus, to resolve k − 1 relaxationtimescales, at least k output nodes or metastablestates are required. However, the network qualitycan improve when given more degrees of freedomin order to approximate the dominant singularfunctions accurately. Indeed, the best scores us-ing k = 4 singular values (3 nontrivial singularvalues) are achieved when using at least six out-put states that separate each of the six metastablestates in the Ramachandran plane (Fig. 5d-e).

For comparison, we investigated how a standardMSM would perform as a function of the num-ber of states (Fig. 5d). For a fair comparison,the MSMs also used Cartesian coordinates as aninput, but then employed a state-of-the-art pro-cedure using a kinetic map transformation thatpreserves 95% of the cumulative kinetic variance[31], followed by k-means clustering, where theparameter k is varied. It is seen that the MSMVAMP-2 scores obtained by this procedure is sig-nificantly worse than by VAMPnets when lessthan 20 states are employed. Clearly, MSMs willsucceed when sufficiently many states are used,but in order to obtain an interpretable modelthose states must again be coarse-grained onto afewer-state model, while VAMPnets directly pro-duce an accurate model with few-states.

Page 8: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

8

Figure 5: Kinetic model of alanine dipeptide as a function of the number of output states. (a-c) Assignment of inputcoordinates, plotted as a function of φ and ψ, to two, three, and eight output states. Color corresponds to activationof the respective output neuron, indicating the membership probability to this state (see 4 b). (d) Comparison ofVAMPnet and MSM performance as a function of the number of output states / MSM states. Mean VAMP-2 score and95% confidence interval from 100 runs are shown. (e) Mean squared values of the four largest singular values that make

up the VAMPnets score plotted in panel (d).

I. VAMPnets learn to transform Cartesianto torsion coordinates

The results above indicate that the VAMPnethas implicitly learned the feature transformationfrom Cartesian coordinates to backbone torsions.In order to probe this ability more explicitly, wetrained a network with 30-10-3-3-2-5 layers, i.e.including a bottleneck of two nodes before theoutput layer. We find that the activation of thetwo bottleneck nodes correlates excellently withthe φ and ψ torsion angles that were not pre-sented to the network (Pearson correlation co-efficients of 0.95 and 0.92, respectively, Supple-mentary Fig. 3 a,b). To visualize the internalrepresentation that the network learns, we colordata samples depending on the free energy min-ima in the φ/ψ space they belong to (Supplemen-tary Fig. 3 c), and then show where these samplesend up in the space of the bottleneck node acti-vations (Supplementary Fig. 3 d). It is apparentthat the network learns a representation of theRamachandran plot – The four free energy min-ima at small φ values (αR and β areas) are rep-resented as contiguous clusters with the correctconnectivity, and are well separated from stateswith large φ values (αL area). The network failsto separate the two substates in the large φ valuerange well, which explains the frequent failure tofind the corresponding transition process and thethird-largest relaxation timescale.

J. NTL9 Protein folding dynamics

In order to proceed to a higher-dimensional prob-lem, we analyze the kinetics of an all-atom pro-tein folding simulation of the NTL9 protein gen-erated by the Anton supercomputer [1]. A five-layer VAMPnet was trained at lag time τ =10 ns using 111, 000 time steps, uniformly sam-pled from a 1.11 ms trajectory. Since NTL9 isfolding and unfolding, there is no unique refer-ence structure to align Cartesian coordinates to– hence we use internal coordinates as a networkinput. We computed the nearest-neighbor heavy-atom distance, dij for all non-redundant pairs ofresidues i and j and transformed them into con-tact maps using the definition cij = exp(−dij),resulting in 666 input nodes.Again, the network performs a hierarchical de-composition of the molecular configuration spacewhen increasing the number of output nodes. Fig.6a shows the decomposition of state space fortwo and five output nodes, and the correspondingmean contact maps and state probabilities. Withtwo output nodes, the network finds the foldedand unfolded state that are separated by theslowest transition process (Fig. 6a, middle row).With five output states, the folded state is decom-posed into a stable and well-defined fully foldedsubstate and a less stable, more flexible substatethat is missing some of the tertiary contacts com-pared to the fully folded substate. The unfoldedsubstate decomposes into three substates, one ofthem largely unstructured, a second one withresidual structure, thus forming a folding inter-

Page 9: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

9

mediate, and a mis-folded state with an entirelydifferent fold including a non-native β-sheet.The relaxation timescales found by a five-stateVAMPnet model are en par with those found bya 40-state MSM using state-of-the-art estimationmethods (Fig. 6b-c). However, the fact that onlyfive states are required in the VAMPnet modelmakes it easier to interpret and analyze. Ad-ditionally, the CK-test indicates excellent agree-ment between long-time predictions and direct es-timates.

DISCUSSION

We have introduced a deep learning frameworkfor molecular kinetics, called VAMPnet. Data-driven learning of molecular kinetics is usu-ally done by shallow learning structures, suchas TICA and MSMs. However, the processingpipeline, typically consisting of featurization, di-mension reduction, MSM estimation and MSMcoarse-graining is, in principle, a hand-crafteddeep learning structure. Here we propose to re-place the entire pipeline by a deep neural networkthat learns optimal feature transformations, di-mension reduction and, if desired, maps the MDtime-steps to a fuzzy clustering. The key to opti-mize the network is the VAMP variational ap-proach which defines scores by which learningstructures can be optimized to learn models ofboth equilibrium and non-equilibrium MD.Although MSM-based kinetic modeling has beenrefined over more than a decade, VAMPnets per-form competitively or superior in our examples.In particular, they perform extremely well inthe Chapman-Kolmogorov test that validates thelong-time prediction of the model. VAMPnetshave a number of advantages over models basedon MSM pipelines: (i) They may be overall moreoptimal, because featurization, dimension reduc-tion and clustering are not explicitly separateprocessing steps. (ii) When using Softmax outputnodes, the VAMPnet performs a fuzzy clusteringof the MD structures fed into the network andconstructs a fuzzy MSM, which is readily inter-pretable in terms of transition probabilities be-tween metastable states. In contrast to otherMSM coarse-graining techniques it is thus notnecessary to accept reduction in model qualityin order to obtain a few-state MSM, but such acoarse-grained model is seamlessly learned withinthe same learning structure. (iii) VAMPnets re-quire less expertise to train than an MSM-basedprocessing pipelines, and the formulation of themolecular kinetics as a neural network learningproblem enables us to exploit an arsenal of highlydeveloped and optimized tools in learning soft-

wares such as tensorflow, theano or keras.Despite their benefits, VAMPnets still miss manyof the benefits that come with extensions devel-oped for MSM approach. This includes multi-ensemble Markov models that are superior tosingle-conventional MSMs in terms of samplingrare events by combining data from multiple en-sembles [71–76], Augmented Markov models thatcombine simulation data with experimental ob-servation data [77], and statistical error estima-tors developed for MSMs [78–80]. Since thesemethods explicitly use the MSM likelihood, itis currently unclear, how they could be imple-mented in a deep learning structure such as aVAMPnet. Extending VAMPnets towards thesespecial capabilities is a challenge for future stud-ies.Finally, a remaining concern is that the optimiza-tion of VAMPnets can get stuck in suboptimallocal maxima. In other applications of network-based learning, a working knowledge has been es-tablished which type of network implementationand learning algorithm are most suitable for ro-bust and reproducible learning. For example, it isconceivable that the VAMPnet lobes may benefitfrom convolutional filters [81] or different types oftransfer functions. Suitably chosen convolutions,as in [82] may also lead to learned feature trans-formations that are transferable within a givenclass of molecules.

METHODS

Neural network structure Each network lobe inFig. 1 has a number of input nodes given by thedata dimension. According to the VAMP varia-tional principle (Sec. A), the output dimensionmust be at least equal to the number of Koopmansingular functions that we want to approximate,i.e. equal to k used in the score function R2.In most applications, the number of input nodesexceeds the number of output nodes, i.e. the net-work conducts a dimension reduction. Here, wekeep the dimension reduction from layer i withni nodes to layer i+ 1 with ni+1 nodes constant:

nini+1

=

(nin

nout

)1/d

(16)

where d is the network depth, i.e. the numberof layers excluding the input layer. Thus, thenetwork structure is fixed by nout and d. Wetested different values for d ranging from 2 to11; For alanine dipeptide, Supplementary Fig. 2breports the results in terms of the training successrate described in the results section. Networkshave a number of parameters that ranges between

Page 10: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

10

Figure 6: VAMPnet re-sults of NTL9 folding ki-netics. (a) Hierarchical de-composition of the NTL9protein state space by anetwork with two and fiveoutput nodes. Mean con-tact maps are shown forall MD samples groupedby the network, alongwith the fraction of sam-ples in that group. 3Dstructures are shown forthe five-state decomposi-tion, residues involved inα-helices or β-sheets inthe folded state are col-ored identically across thedifferent states. (b) Re-laxation timescales com-puted from the Koopmanmodel approximated usingthe transformation appliedby a neural network withfive output nodes. (c) Re-laxation timescales from aMarkov state model com-puted from a TICA trans-formation of the contactmaps, followed by k-meansclustering with k = 40.(d) Chapman-Kolmogorovtest comparing long-timepredictions of the Koop-man model estimated atτ = 320ns and estimatesat longer lag times. Pan-els (b), (c) and (d) report95% confidence interval er-ror bars over 100 training

runs.

100 and 400000, most of which are between thefirst and second layer due to the rapid dimensionreduction of the network. To avoid overfitting,we use dropout during training [83], and selecthyper-parameters using the VAMP-2 validationscore.

Neural network hyperparameters Hyper-parameters include the regularization factorsfor the weights of the fully connected and theSoftmax layer, the dropout probabilities for eachlayer, the batch-size, and the learning rate forthe Adam algorithm. Since a grid search in thejoint parameter space would have been too com-putationally expensive, each hyper-parameterwas optimized using the VAMP-2 validationscore while keeping the other hyper-parametersconstant. We started with the regularizationfactors due to their large effect on the trainingperformance, and observed optimal performancefor a factor of 10−7 for the fully connectedhidden layers and 10−8 for the output layer; reg-ularization factors higher than 10−4 frequently

led to training failure. Subsequently, we testedthe dropout probabilities with values rangingfrom 0 to 50 % and found 10 % dropout in thefirst two hidden layers and no dropout otherwiseto perform well. The results did not stronglydepend on the training batch size, however,more training iterations are necessary for largebatches, while small batches exhibit strongerfluctuations in the training score. We found abatch-size of 4000 to be a good compromise, withtested values ranging between 100 and 16000.The optimal learning rate strongly depends onthe network topology (e.g. the number of hiddenlayers and the number of output nodes). In orderto adapt the learning rate, we started from anarbitrary rate of 0.05. If no improvement on thevalidation VAMP-2 score was observed over 10training iterations, the learning rate was reducedby a factor of 10. This scheme led to betterconvergence of the training and validation scoresand better kinetic model validation compared tousing a high learning rate throughout.

Page 11: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

11

The time lag between the input pairs of configu-rations was selected depending on the number ofoutput nodes of the network: larger lag times arebetter at isolating the slowest processes, and thusare more suitable with a small number of outputnodes. The procedure of choosing network struc-ture and lag time is thus as follows: First, thenumber of output nodes n and the hidden lay-ers are selected, which determines the networkstructure as described above. Then, a lag time ischosen in which the largest n singular values (cor-responding to the n− 1 slowest processes) can betrained consistently.VAMPnet training and validation We pre-trained the network by minimizing the negativeVAMP-1 score during the first third of the totalnumber of epochs, and subsequently optimize thenetwork with VAMP-2 optimization (Sec. B). Inorder to ensure robustness of the results, we per-formed 100 network optimization runs for eachproblem. In each run, the dataset was shuffledand randomly split into 90%/10% for trainingand validation, respectively. To exclude outliers,we then discarded the best 5% and the worst5% of results. Hyperparameter optimization wasdone using the validation score averaged over theremaining runs. Figures report training or vali-dation mean and 95% confidence intervals.Brownian dynamics simulations The asymmet-ric double well and the protein folding toy modelare simulated by over-damped Langevin dynam-

ics in a potential energy function U(x), alsoknown as Brownian dynamics, using an forwardEuler integration scheme. The position xt ispropagated by time step ∆t via:

xt+∆t = xt −∆t∇U(x)

kT+√

2∆tDwt, (17)

where D is the diffusion constant and kT is theBoltzmann constant and temperature. Here, di-mensionless units are used and D = 1, kT = 1.The elements of the random vector wt are sam-pled from a normal distribution with zero meanand unit variance.

Hardware used and training times VAMPnetswere trained on a single NVIDIA GeForce GTX1080 GPU, requiring between 20 seconds (for thedouble well problem) and 180 seconds for NTL9for each run.

Code availability TICA, k-means and MSManalyses were conducted with PyEMMA version2.4, freely available at pyemma.org. VAMPnetsare implemented using the freely available pack-ages keras [84] with tensorflow-gpu [85] as abackend. The code can be obtained at https://github.com/markovmodel/deeptime.

Data availability Data for NTL9 can be re-quested from the autors of [1]. Data for all otherexamples is available at https://github.com/markovmodel/deeptime.

[1] K. Lindorff-Larsen, S. Piana, R. O. Dror, andD. E. Shaw. How fast-folding proteins fold. Sci-ence, 334:517–520, 2011.

[2] Nuria Plattner, Stefan Doerr, Gianni De Fabri-tiis, and Frank Noé. Protein-protein associationand binding mechanism resolved in atomic de-tail. Nat. Chem., in revision.

[3] K. J. Kohlhoff, D. Shukla, M. Lawrenz, G. R.Bowman, D. E. Konerding, D. Belov, R. B. Alt-man, and V. S. Pande. Cloud-based simulationson google exacycle reveal ligand modulation ofgpcr activation pathways. Nat. Chem., 6:15–21,2014.

[4] S. Doerr, M. J. Harvey, F. Noé, and G. De Fab-ritiis. HTMD: High-Throughput Molecular Dy-namics for Molecular Discovery. J. Chem. The-ory Comput., 12:1845–1852, 2016.

[5] I. S. Ufimtsev and T. J. Martinez. Graphicalprocessing units for quantum chemistry. Comp.Sci. Eng., 10:26–34, 2008.

[6] Dominik Marx and Jürg Hutter. Ab initiomolecular dynamics: Theory and implementa-tion. In J. Grotendorst (Ed.), editor, ModernMethods and Algorithms of Quantum Chemistry,volume 1 of NIC Series, pages 301–449. John vonNeumann Institute for Computing, Jülich, 2000.

[7] C. Schütte, A. Fischer, W. Huisinga, and P. Deu-flhard. A Direct Approach to ConformationalDynamics based on Hybrid Monte Carlo. J.Comput. Phys., 151:146–168, 1999.

[8] J.-H. Prinz, H. Wu, M. Sarich, B. G. Keller,M. Senne, M. Held, J. D. Chodera, C. Schütte,and F. Noé. Markov models of molecular kinet-ics: Generation and validation. J. Chem. Phys.,134:174105, 2011.

[9] W. C. Swope, J. W. Pitera, and F. Suits. De-scribing protein folding kinetics by molecular dy-namics simulations: 1. Theory. J. Phys. Chem.B, 108:6571–6581, 2004.

[10] F. Noé, I. Horenko, C. Schütte, and J. C. Smith.Hierarchical Analysis of Conformational Dynam-ics in Biomolecules: Transition Networks ofMetastable States. J. Chem. Phys., 126:155102,2007.

[11] J. D. Chodera, K. A. Dill, N. Singhal, V. S.Pande, W. C. Swope, and J. W. Pitera. Auto-matic discovery of metastable states for the con-struction of Markov models of macromolecularconformational dynamics. J. Chem. Phys., 126:155101, 2007.

[12] N. V. Buchete and G. Hummer. Coarse Mas-ter Equations for Peptide Folding Dynamics. J.

Page 12: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

12

Phys. Chem. B, 112:6057–6069, 2008.[13] M. K. Scherer, B. Trendelkamp-Schroer, F. Paul,

G. Perez-Hernandez, M. Hoffmann, N. Plattner,J.-H. Prinz, and F. Noé. PyEMMA 2: A softwarepackage for estimation, validation and analysis ofMarkov models. J. Chem. Theory Comput., 11:5525–5542, 2015.

[14] M. P. Harrigan, M. M. Sultan, C. X. Hernández,B. E. Husic, P. Eastman, C. R. Schwantes, K. A.Beauchamp, R. T. McGibbon, and V. S. Pande.Msmbuilder: Statistical models for biomoleculardynamics. Biophys J., 112:10–15, 2017.

[15] W. Humphrey, A. Dalke, and K. Schulten. Vmd- visual molecular dynamics. J. Molec. Graphics,14:33–38, 1996.

[16] R. T. McGibbon, K. A. Beauchamp, M. P. Har-rigan, C. Klein, J. M. Swails, C. X. Hernández,C. R. Schwantes, L. P. Wang, T. J. Lane, andV. S. Pande. Mdtraj: A modern open library forthe analysis of molecular dynamics trajectories.Biophys J., 109:1528–1532, 2015.

[17] F. Noé and F. Nüske. A variational approach tomodeling slow processes in stochastic dynamicalsystems. Multiscale Model. Simul., 11:635–655,2013.

[18] F. Nüske, B. G. Keller, G. Pérez-Hernández,A. S. J. S. Mey, and F. Noé. Variational ap-proach to molecular kinetics. J. Chem. TheoryComput., 10:1739–1752, 2014.

[19] G. Perez-Hernandez, F. Paul, T. Giorgino, G. DFabritiis, and Frank Noé. Identification of slowmolecular order parameters for markov modelconstruction. J. Chem. Phys., 139:015102, 2013.

[20] C. R. Schwantes and V. S. Pande. Improvementsin markov state model construction reveal manynon-native interactions in the folding of ntl9. J.Chem. Theory Comput., 9:2000–2009, 2013.

[21] L. Molgedey and H. G. Schuster. Separationof a mixture of independent signals using timedelayed correlations. Phys. Rev. Lett., 72:3634–3637, 1994. doi:10.1103/PhysRevLett.72.3634.

[22] A. Ziehe and K.-R. Müller. TDSEP — anefficient algorithm for blind separation usingtime structure. In ICANN 98, pages 675–680.Springer Science and Business Media, 1998. doi:10.1007/978-1-4471-1599-1_103.

[23] S. Harmeling, A. Ziehe, M. Kawanabe, and K.-R. Müller. Kernel-based nonlinear blind sourceseparation. Neur. Comp., 15:1089–1124, 2003.

[24] I. Mezić. Spectral properties of dynamical sys-tems, model reduction and decompositions. Non-linear Dynam., 41:309–325, 2005.

[25] P. J. Schmid and J. Sesterhenn. Dynamic modedecomposition of numerical and experimentaldata. In 61st Annual Meeting of the APS Divi-sion of Fluid Dynamics. American Physical So-ciety, 2008.

[26] J. H. Tu, C. W. Rowley, D. M. Luchtenburg,S. L. Brunton, and J. N. Kutz. On dynamicmode decomposition: Theory and applications.J. Comput. Dyn., 1(2):391–421, dec 2014. doi:10.3934/jcd.2014.1.391.

[27] M. O. Williams, I. G. Kevrekidis, and C. W.Rowley. A data–driven approximation of the

koopman operator: Extending dynamic modedecomposition. J. Nonlinear Sci., 25:1307–1346,2015.

[28] H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai,and F. Noé. Variational koopman models: slowcollective variables and molecular kinetics fromshort off-equilibrium simulations. J. Chem.Phys., 146:154104, 2017.

[29] F. Noé and C. Clementi. Collective variables forthe study of long-time kinetics from moleculartrajectories: theory and methods. Curr. Opin.Struc. Biol., 43:141–147, 2017.

[30] S. Klus, F. Nüske, P. Koltai, H. Wu,I. Kevrekidis, C. Schütte, and F. Noé. Data-driven model reduction and transfer operator ap-proximation. arXiv:1703.10112, 2017.

[31] F. Noé and C. Clementi. Kinetic distance and ki-netic maps from molecular dynamics simulation.J. Chem. Theory Comput., 11:5002–5011, 2015.

[32] F. Noé, R. Banisch, and C. Clementi. Commutemaps: separating slowly-mixing molecular con-figurations for kinetic modeling. J. Chem. The-ory Comput., 12:5620–5630, 2016.

[33] G. R. Bowman, V. S. Pande, and F. Noé, ed-itors. An Introduction to Markov State Modelsand Their Application to Long Timescale Molec-ular Simulation., volume 797 of Advances in Ex-perimental Medicine and Biology. Springer Hei-delberg, 2014.

[34] B. E. Husic and V. S. Pande. Ward clusteringimproves cross-validated markov state models ofprotein folding. J. Chem. Theo. Comp., 13:963–967, 2017.

[35] F. K. Sheong, D.-A. Silva, L. Meng, Y. Zhao,and X. Huang. Automatic State Partitioning forMultibody Systems (APM): An Efficient Algo-rithm for Constructing Markov State Models ToElucidate Conformational Dynamics of Multi-body Systems. J. Chem. Theory Comput., 11:17–27, 2015.

[36] H. Wu and F. Noé. Gaussian markov transitionmodels of molecular kinetics. J. Chem. Phys.,142:084104, 2015.

[37] M. Weber, K. Fackeldey, and C. Schütte. Set-freemarkov state model building. J. Chem. Phys.,146:124133„ 2017.

[38] G. R. Bowman, K. A. Beauchamp, G. Boxer,and V. S. Pande. Progress and challenges in theautomated construction of Markov state modelsfor full protein systems. J. Chem. Phys., 131:124101, 2009.

[39] B. Trendelkamp-Schroer, H. Wu, F. Paul, andF. Noé. Estimation and uncertainty of reversiblemarkov models. J. Chem. Phys., 143:174101,2015.

[40] S. Kube and M. Weber. A coarse grainingmethod for the identification of transition ratesbetween molecular conformations. J. Chem.Phys., 126:024103, 2007. doi:10.1063/1.2404953.URL http://dx.doi.org/10.1063/1.2404953.

[41] Y. Yao, R. Z. Cui, G. R. Bowman, D.-A. Silva,J. Sun, and X. Huang. Hierarchical nyströmmethods for constructing markov state modelsfor conformational dynamics. J. Chem. Phys.,

Page 13: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

13

138:174106, 2013.[42] K. Fackeldey and M. Weber. Genpcca – markov

state models for non-equilibrium steady states.WIAS Report, 29:70–80, 2017.

[43] S. Gerber and I. Horenko. Toward a direct andscalable identification of reduced models for cat-egorical processes. Proc. Natl. Acad. Sci. USA,114:4863–4868, 2017.

[44] G. Hummer and A. Szabo. Optimal dimension-ality reduction of multistate kinetic and markov-state models. J. Phys. Chem. B, 119:9029–9037,2015.

[45] S. Orioli and P. Faccioli. Dimensional reduc-tion of markov state models from renormaliza-tion group theory. J. Chem. Phys., 145:124120,2016.

[46] F. Noé, H. Wu, J.-H. Prinz, and N. Plattner.Projected and hidden markov models for calcu-lating kinetics and metastable states of complexmolecules. J. Chem. Phys., 139:184114, 2013.

[47] H. Wu and F. Noé. Variational approach forlearning markov processes from time series data.arXiv, 2017.

[48] R. T. McGibbon and V. S. Pande. Varia-tional cross-validation of slow dynamical modesin molecular kinetics. J. Chem. Phys., 142:124105, 2015.

[49] Y. LeCun, Y. Bengio, and G. E. Hinton. Deeplearning. Nature, 521(7553):436–444, 2015.

[50] A. Krizhevsky, I. Sutskever, and G. E. Hinton.Imagenet classification with deep convolutionalneural networks. NIPS, pages 1097–1105, 2012.

[51] V. Mnih, K. Kavukcuoglu, D. Silver, A. A.Rusu, J. Veness, M. G. Bellemare, A. Graves,A. K. Fidjeland, M. Riedmiller, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg,and D. Hassabis. Human-level control throughdeep reinforcement learning. Nature, 518:529–533, 2015.

[52] G. Perez-Hernandez and F. Noé. Hierarchi-cal time-lagged independent component analysis:computing slow modes and reaction coordinatesfor large molecular systems. J. Chem. TheoryComput., 12:6118–6129, 2016.

[53] F. Nüske, R. Schneider, F. Vitalini, and F. Noé.Variational tensor approach for approximatingthe rare-event kinetics of macromolecular sys-tems. J. Chem. Phys., 144:054105, 2016.

[54] B.O. Koopman. Hamiltonian systems and trans-formations in hilbert space. Proc. Natl. Acad.Sci. USA, 17:315–318, 1931.

[55] F. Knoch and T. Speck. Cycle representativesfor the coarse-graining of systems driven into anon-equilibrium steady state. New J. Phys., 17:115004, 2015.

[56] H. Wang and C. Schütte. Building markov statemodels for periodically driven non-equilibriumsystems. J. Chem. Theory Comput., 11:1819–1831, 2015.

[57] I. Horenko, C. Hartmann, C. Schütte, andF. Noé. Data-based parameter estimationof generalized multidimensional Langevin pro-cesses. Phys. Rev. E, 76:016706, 2007.

[58] G. Cybenko. Approximation by superpositionsof a sigmoidal function. Math. Control Signals,2(4):303–314, 1989.

[59] D. Eigen, J. Rolfe, R. Fergus, and Y. LeCun. Un-derstanding deep architectures using a recursiveconvolutional network. ICLR, 2014.

[60] M. Ranzato, C. Poultney, S. Chopra, and Y. Le-Cun. Efficient learning of sparse representationswith an energy-based model. In J. Platt et al.,editor, Adv. Neural Inf. Process. Syst. (NIPS).MIT Press, 2006.

[61] Y. Bengio, P. Lamblin, D. Popovici, andH. Larochelle. Greedy layer-wise training of deepnetworks,. In Adv. Neural Inf. Process. Syst.(NIPS), volume 19, page 153. MIT Press, 2007.

[62] A. Galen, R. Arora, J. Bilmes, and K. Livescu.Deep canonical correlation analysis. ICML Proc.,28, 2013.

[63] S. Röblitz and M. Weber. Fuzzy spectral clus-tering by PCCA+: application to Markov statemodels and data classification. Adv. Data Anal.Classif., 7:147–179, 2013.

[64] M. Sarich, F. Noé, and C. Schütte. On theapproximation quality of markov state models.Multiscale Model. Simul., 8:1154–1177, 2010.

[65] F. Noé, C. Schütte, E. Vanden-Eijnden, L. Reich,and T. R. Weikl. Constructing the full ensembleof folding pathways from short off-equilibriumsimulations. Proc. Natl. Acad. Sci. USA, 106:19011–19016, 2009.

[66] R. L. T. Hahnloser. On the piecewise analysisof networks of linear threshold neurons. NeuralNetw., 11(4):691–697, 1998.

[67] V. Nair and G. E. Hinton. Rectified linearunits improve restricted boltzmann machines. InICML Proc., volume 27, pages 807–814, 2010.

[68] M. P. Harrigan and V. S. Pande. Landmark ker-nel tica for conformational dynamics. bioRxiv,123752, 2017.

[69] D. P. Kingma and J. Ba. Adam: A method forstochastic optimization. arXiv.org:1412.6980,2014.

[70] F. Nüske, H. Wu, C. Wehmeyer, C. Clementi,and F. Noé. Markov state models from short non-equilibrium simulations - analysis and correctionof estimation bias. arXiv:1701.01665, 2017.

[71] H. Wu, F. Paul, C. Wehmeyer, and F. Noé.Multiensemble markov models of molecular ther-modynamics and kinetics. Proc. Natl. Acad.Sci. USA, 113(23):E3221–E3230, 2016. doi:10.1073/pnas.1525092113.

[72] H. Wu, A. S. J. S. Mey, E. Rosta, and F. Noé.Statistically optimal analysis of state-discretizedtrajectory data from multiple thermodynamicstates. J. Chem. Phys., 141:214106, 2014.

[73] J. D. Chodera, W. C. Swope, F. Noé, J.-H. Prinz,and V. S. Pande. Dynamical reweighting: Im-proved estimates of dynamical properties fromsimulations at multiple temperatures. J. Phys.Chem., 134:244107, 2011.

[74] J.-H. Prinz, J. D. Chodera, V. S. Pande, W. C.Swope, J. C. Smith, and F. Noé. Optimal useof data in parallel tempering simulations for theconstruction of discrete-state markov models of

Page 14: VAMPnets for deep learning of molecular kinetics · VAMPnets for deep learning of molecular kinetics Andreas Mardt 1, Luca Pasquali , Hao Wu and Frank Noé Freie Universität Berlin,

14

biomolecular dynamics. J. Chem. Phys., 134:244108, 2011.

[75] E. Rosta and G. Hummer. Free energies fromdynamic weighted histogram analysis using un-biased markov state model. J. Chem. TheoryComput., 11:276–285, 2015.

[76] A. S. J. S. Mey, H. Wu, and F. Noé. xTRAM:Estimating equilibrium expectations from time-correlated simulation data at multiple thermo-dynamic states. Phys. Rev. X, 4:041018, 2014.

[77] S. Olsson, H. Wu, F. Paul, C. Clementi, andF. Noé. Combining experimental and simula-tion data of molecular processes via augmentedmarkov models. Proc. Natl. Acad. Sci. USA, inpress.

[78] N. S. Hinrichs and V. S. Pande. Calculation ofthe distribution of eigenvalues and eigenvectorsin Markovian state models for molecular dynam-ics. J. Chem. Phys., 126:244101, 2007.

[79] F. Noé. Probability Distributions of MolecularObservables computed from Markov Models. J.Chem. Phys., 128:244103, 2008.

[80] J. D. Chodera and F. Noé. Probability distri-butions of molecular observables computed frommarkov models. ii: Uncertainties in observablesand their time-evolution. J. Chem. Phys., 133:105102, 2010.

[81] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to documentrecognition. Proc. IEEE, 86:2278–2324, 1998.

[82] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda,S. Chmiela, A. Tkatchenko, and K.-R. Müller.Moleculenet: A continuous-filter convolutionalneural network for modeling quantum interac-tions. arXiv:1706.08566, 2017.

[83] N. Srivastava, G. Hinton, A. Krizhevsky,I. Sutskever, and R. Salakhutdinov. Dropout:

A simple way to prevent neural networks fromoverfitting. J. Mach. Learn. Res., 15:1929–1958,2014.

[84] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

[85] Martín Abadi, Ashish Agarwal, Paul Barham,Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S. Corrado, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Ian Good-fellow, Andrew Harp, Geoffrey Irving, MichaelIsard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, DanMané, Rajat Monga, Sherry Moore, Derek Mur-ray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar,Paul Tucker, Vincent Vanhoucke, Vijay Vasude-van, Fernanda Viégas, Oriol Vinyals, Pete War-den, Martin Wattenberg, Martin Wicke, YuanYu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous sys-tems. arXiv.org:1603.04467, 2015.

Author contributions A.M. and L.P. conductedresearch and developed software. H.W and F.Ndesigned research and developed theory. All au-thors wrote the paper.Acknowledgements We are grateful to CeciliaClementi, Robert T. McGibbon and Max Wellingfor valuable discussions. This work was fundedby Deutsche Forschungsgemeinschaft (Transregio186/A12, SFB 1114/A4, NO 825/4-1 as partof research group 2518) and European ResearchCommission (ERC StG 307494 “pcCell”).