a generative angular model of protein structure...

16
A Generative Angular Model of Protein Structure Evolution Michael Golden,* ,1 Eduardo Garc ıa-Portugue ´s, 2,3,5 Michael Sørensen, 3 Kanti V. Mardia, 1,4 Thomas Hamelryck, 5,6 and Jotun Hein 1 1 Department of Statistics, University of Oxford, Oxford, United Kingdom 2 Department of Statistics, Carlos III University of Madrid, Madrid, Spain 3 Department of Mathematical Sciences, University of Copenhagen, Copenhagen, Denmark 4 Department of Mathematics, University of Leeds, Leeds, United Kingdom 5 Bioinformatics Centre, Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark 6 Image Section, Department of Computer Science, University of Copenhagen, Copenhagen, Denmark *Corresponding author: E-mail: [email protected]. Associate editor: Jeffrey Thorne Protein families were obtained from the HOMSTRAD database. Abstract Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural informa- tion in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modeled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evo- lution in our model allows for modeling both “smooth” conformational changes and “catastrophic” conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence–structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof. Key words: evolution, protein structure, probabilistic model, directional statistics. Introduction Recently, several studies (Challis and Schmidler 2012; Herman et al. 2014) have proposed joint stochastic models of evolu- tion which take into account simultaneous alignment of pro- tein sequence and structure. These studies point out the limitations of earlier non-probabilistic methods, which often rely on heuristic procedures to infer parameters of interest. A major disadvantage of using heuristic procedures is that they typically fail to account for sources of uncertainty. For exam- ple, relying on a single fixed alignment, which is highly unlikely to be the true underlying alignment, may bias the inference of the posterior distribution over evolutionary trees. We present a generative evolutionary model, ETDBN (Evolutionary Torus Dynamic Bayesian Network) for pairs of homologous proteins. ETDBN captures dependencies be- tween sequence and structure evolution, accounts for align- ment uncertainty, and models the local dependencies between aligned sites. A key step in modeling protein structure evolution is se- lecting a suitable structural representation and corresponding evolutionary model. Early works by Gutin and Badretdinov (1994) and Grishin (1997) represented protein structure using three-dimensional Cartesian coordinates of protein backbone atoms and used diffusions processes to model the relation- ship between structural distance (measured using RMSD) and sequence similarity. More recent publications by Challis and Schmidler (2012) and Herman et al. (2014) likewise used the three-dimensional Cartesian coordinates of amino acid C a atoms to represent protein structure and additionally used Ornstein-Uhlenbeck (OU) processes to construct Bayesian probabilistic models of protein structure evolution. These models emphasize estimation of evolutionary parameters such as the evolutionary time between species, tree topolo- gies and alignment, and attempt to fully account for sources of uncertainty. For the sake of computational tractability, the aforementioned approaches treat the Cartesian coordinates associated with atoms as evolving independently of another. A non-probabilistic approach by Echave (2008) and Echave and Fern andez (2010) referred to as the Linearly Forced Elastic Network Model (LFENM) treats protein structures as a col- lection of C a atoms connected by spring forces. The major Article ß The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access Mol. Biol. Evol. 34(8):2085–2100 doi:10.1093/molbev/msx137 Advance Access publication April 27, 2017 2085 Downloaded from https://academic.oup.com/mbe/article-abstract/34/8/2085/3772139 by Edward Boyle Library user on 08 November 2017

Upload: others

Post on 20-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

A Generative Angular Model of Protein Structure Evolution

Michael Golden1 Eduardo Garcıa-Portugues235 Michael Soslashrensen3 Kanti V Mardia14

Thomas Hamelryck56 and Jotun Hein1

1Department of Statistics University of Oxford Oxford United Kingdom2Department of Statistics Carlos III University of Madrid Madrid Spain3Department of Mathematical Sciences University of Copenhagen Copenhagen Denmark4Department of Mathematics University of Leeds Leeds United Kingdom5Bioinformatics Centre Section for Computational and RNA Biology Department of BiologyUniversity of Copenhagen Copenhagen Denmark6Image Section Department of Computer Science University of Copenhagen Copenhagen Denmark

Corresponding author E-mail goldenstatsoxacuk

Associate editor Jeffrey Thorne

Protein families were obtained from the HOMSTRAD database

Abstract

Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural informa-tion in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters We present agenerative evolutionary model of protein structure and sequence that is valid on a local length scale The model concernsthe local dependencies between sequence and structure evolution in a pair of homologous proteins The evolutionarytrajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space which ismodeled using a novel angular diffusion process on the two-dimensional torus Coupling sequence and structure evo-lution in our model allows for modeling both ldquosmoothrdquo conformational changes and ldquocatastrophicrdquo conformationaljumps conditioned on the amino acid changes The model has interpretable parameters and is comparatively morerealistic than previous stochastic models providing new insights into the relationship between sequence and structureevolution For example using the trained model we were able to identify an apparent sequencendashstructure evolutionarymotif present in a large number of homologous protein pairs The generative nature of our model enables us to evaluateits validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence a relatedamino acid sequence a related structure or any combination thereof

Key words evolution protein structure probabilistic model directional statistics

Introduction

Recently several studies (Challis and Schmidler 2012 Hermanet al 2014) have proposed joint stochastic models of evolu-tion which take into account simultaneous alignment of pro-tein sequence and structure These studies point out thelimitations of earlier non-probabilistic methods which oftenrely on heuristic procedures to infer parameters of interest Amajor disadvantage of using heuristic procedures is that theytypically fail to account for sources of uncertainty For exam-ple relying on a single fixed alignment which is highly unlikelyto be the true underlying alignment may bias the inference ofthe posterior distribution over evolutionary trees

We present a generative evolutionary model ETDBN(Evolutionary Torus Dynamic Bayesian Network) for pairsof homologous proteins ETDBN captures dependencies be-tween sequence and structure evolution accounts for align-ment uncertainty and models the local dependenciesbetween aligned sites

A key step in modeling protein structure evolution is se-lecting a suitable structural representation and corresponding

evolutionary model Early works by Gutin and Badretdinov(1994) and Grishin (1997) represented protein structure usingthree-dimensional Cartesian coordinates of protein backboneatoms and used diffusions processes to model the relation-ship between structural distance (measured using RMSD)and sequence similarity More recent publications by Challisand Schmidler (2012) and Herman et al (2014) likewise usedthe three-dimensional Cartesian coordinates of amino acid Ca

atoms to represent protein structure and additionally usedOrnstein-Uhlenbeck (OU) processes to construct Bayesianprobabilistic models of protein structure evolution Thesemodels emphasize estimation of evolutionary parameterssuch as the evolutionary time between species tree topolo-gies and alignment and attempt to fully account for sourcesof uncertainty For the sake of computational tractability theaforementioned approaches treat the Cartesian coordinatesassociated with atoms as evolving independently of anotherA non-probabilistic approach by Echave (2008) and Echaveand Fernandez (2010) referred to as the Linearly Forced ElasticNetwork Model (LFENM) treats protein structures as a col-lection of Ca atoms connected by spring forces The major

Article

The Author 2017 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open AccessMol Biol Evol 34(8)2085ndash2100 doi101093molbevmsx137 Advance Access publication April 27 2017 2085Downloaded from httpsacademicoupcommbearticle-abstract34820853772139

by Edward Boyle Library useron 08 November 2017

benefit of LFENMs is that they do not assume independenceof atomic coordinates and take into account non-local de-pendencies due to physical interactions In their current for-mulation LFENMs do not distinguish between the differingchemical nature of different amino acids and therefore do notaccount for the variable effect of sequence mutation on pro-tein structure evolution

Rather than using a Cartesian coordinate representationour model ETDBN uses a dihedral angle representation mo-tivated by the non-evolutionary TorusDBN model (Boomsmaet al 2008 2014) TorusDBN represents a single protein struc-ture as a sequence of ethwTHORN dihedral angle pairs which aremodeled using continuous bivariate angular distributions(Frellsen et al 2012) Likewise ETDBN treats protein structureas a random walk in space again making use of the and wdihedral angles (top of fig 1)

The dihedral angle representation is informed by the chem-ical nature of peptide bonds Each amino acid in a proteinpeptide chain is covalently bonded to the next via a peptidebond Peptide bonds have a partial double bond nature thatresults in a planar configuration of atoms in space This config-uration allows the protein backbone structure to be largelydescribed in terms of a series of and w dihedral angles thatdefines the relationship between the planes in three-dimensional space A benefit of this representation is that itbypasses the need for structural alignment unlike in models onCartesian coordinates which typically need to additionally su-perimpose the structures for comparison purposes (Hermanet al 2014) Accordingly having to account for superimpositionintroduces an additional source of uncertainty A further ad-vantage of the dihedral angle representation is that there arefewer degrees of freedom per amino acid and therefore typicallyfewer parameters required in order to model their evolution

The evolution of dihedral angles in ETDBN is modeledusing a novel stochastic diffusion process developed inGarcıa-Portugues et al (2017) In addition to this a couplingis introduced such that an amino acid change can lead to ajump in dihedral angles and a change in diffusion processallowing the model to capture changes in amino acid that aredirectionally coupled with changes in dihedral angle or sec-ondary structure As in Challis and Schmidler (2012) andHerman et al (2014) the insertion and deletion (indel) evo-lutionary process is also modeled in order to account foralignment uncertainty (Thorne et al 1992)

The OU processes used in Challis and Schmidler (2012)and Herman et al (2014) ignore bond lengths and treat Ca

atoms as evolving independently for the sake of computa-tionally tractability Furthermore the OU process makesGaussian assumptions From a generative perspective theseproperties will lead to evolved proteins with Ca atoms thatare unnaturally dispersed in space Bond lengths are also ig-nored in ETDBN but can be plausibly fixed or modeled As aresult it is expected that the use of angular diffusions willmuch more naturally capture the underlying protein struc-ture manifold

Two or more homologous proteins will share a commonancestor which leads to underlying tree-like dependenciesThese dependencies manifest themselves most noticeably in

the degree of amino acid sequence similarity between twohomologous proteins The strength of these dependencies isassumed to be a result of two major factors the time since thecommon ancestor and the rate of evolution

Failing to account for evolutionary dependencies can leadto false conclusions (Felsenstein 1985) whereas accounting forevolutionary dependencies allows information from homolo-gous proteins to be incorporated in a principled manner Thiscan lead to more accurate inferences such as the prediction ofa protein structure from a homologous protein sequence andstructure known as homology modeling (Arnold et al 2006)Stochastic models such as ETDBN are not expected to com-pete with homology modeling software such as SWISS-MODEL (Arnold et al 2006) However they allow for estima-tion of evolutionary parameters and statements about uncer-tainty to be made in a statistically rigorous manner

Most models of structural evolution ignore dependenciesamongst sites because of the increased computational de-mand and complexity associated with such models Thesedependencies are expected to influence patterns of evolutionspecifically patterns of amino acid substitution The currentmodel deals with local dependencies onlymdashdependencies thatare expected to arise due to interactions between neighboringamino acids for example between amino acids in an a-helixETDBN does not account for global dependenciesmdashdepen-dencies that result in the globular nature of proteins(Boomsma et al 2008) In ETDBN we attempt to model localdependencies only by using a Hidden Markov Model (HMM)to capture dependencies amongst neighboring aligned posi-tions HMMs such as PASSML (Lio et al 1998) have beensuccessfully used to predict protein secondary structurefrom aligned sequences however these models typicallyhave the disadvantage that they assume a canonical secondarystructure shared amongst all the sequences being analyzedThis restricts analysis to closely related sequences where con-servation of secondary structure is a reasonable assumptionETDBN does not assume a canonical secondary structure butinstead uses a phylogenetic HMM approach similar to Siepeland Haussler (2004) that assumes dependencies between evo-lutionary processes at neighboring aligned positions

Parameters of ETDBN were estimated using 1200 homol-ogous protein pairs from the HOMSTRAD database(Mizuguchi et al 1998) The resulting model provides a real-istic prior distribution over proteins and protein structureevolution in comparison to previous stochastic modelsDoing so enables biological insights into the relationship be-tween sequence and structure evolution such as patterns ofamino acid change that are informative of patterns of struc-tural change (Grishin 2001) It was with these features in mindthat ETDBN was developed

Evolutionary Model

OverviewETDBN is a dynamic Bayesian network model of local proteinsequence and structure evolution along a pair of aligned ho-mologous proteins pa and pb ETDBN can be can be viewed asan HMM (see fig 1) Each hidden node of the HMM

Golden et al doi101093molbevmsx137 MBE

2086Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

corresponding to an aligned position adopts an evolutionaryhidden state specifying a distribution over three different ob-servations pairs a pair of amino acid characters a pair ofdihedral angles and a pair of secondary structures classifica-tions A transition probability matrix specifies neighboringdependencies between adjacent evolutionary states For ex-ample transitions along the alignment between hidden statesencoding predominantly a-helix evolution would be ex-pected to occur more frequently than transitions betweenan evolutionary hidden state encoding predominantly a-helixevolution and another encoding predominantly b-sheetevolution

Ideally the underlying hidden states would not just varyacross the length of the alignment as captured by the HMMin the current model but also evolve along the branches ofthe phylogenetic tree This remains computationally intrac-table at present Allowing the hidden states to evolve alongthe tree would allow capturing large structural changes eveninduced by a single mutation For now we model such eventsusing a jump model (see below)

Partially in order to mitigate this each hidden state speci-fies a distribution over a pair of site-classes at each alignedposition This gives rise to the possibility of a lsquojump eventrsquo Ajump event allows a large change in dihedral angle or

secondary structure (eg helix to sheet) to occur at a givenaligned position and also introduces a directional couplingbetween changes in amino acid that are informative ofchanges in dihedral angle or secondary structureconformation

Observation TypesThe two proteins pa and pb in a homologous pair are asso-ciated with a pair of observation sequences Oa and Ob ob-tained from experimental data respectively An ith siteobservation pair Oi frac14 ethOxethiTHORN

a OyethiTHORNb THORN is associated with every

aligned site i in an alignment Mab of pa and pb where Miab

2 f yx x

yeth THORNg specifies the homology relationship at

position i of the alignment (homologous deletion with re-spect to pa and insertion with respect to pa respectively) i istaken to run from 1 to m m is the length of the alignmentMab and x 2 f1 jpajg and y 2 f1 jpbjg specifythe indices of the positions in pa and pb respectively jpajand jpaj give the number of sites in pa and pb respectively

Each site observation OxethiTHORNa and O

yethiTHORNb contains amino acid

and structural information corresponding to the two Ca

atoms at aligned site i belonging to each of the two proteinsA site observation corresponding to a particular protein ataligned site i O

xethiTHORNa is comprised of three different data types

FIG 1 Above dihedral angle representation A small section of a single protein backbone (three amino acids) with and w dihedral angles showntogether with Ca atoms which attach to the amino acid side-chains Each amino acid side-chain determines the characteristic nature of each aminoacid Every amino acid position corresponds to a hidden node in the HMM below Note that we only show a single protein whereas the modelconsiders a pair Below depiction of HMM architecture of ETDBN where each H along the horizontal axis represents an evolutionary hidden nodeThe horizontal edges between evolutionary hidden nodes encode neighboring dependencies between aligned sites The arrows between theevolutionary hidden nodes and site-class pair nodes encode the conditional independence between the observation pair variables Axi

a Ayi

b (aminoacid site pair) Xxi

a frac14 hxia w

xia i X

yi

b frac14 hyi

b wyi

b i (dihedral angle site pair) and Sxia S

yi

b (secondary structure class site pair) The circles representcontinuous variables and the rectangles represent discrete variables

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2087Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

associated with the Ca atom an amino acid (AxethiTHORNa discrete

one of twenty canonical amino acids) and w dihedral an-gles (X

xethiTHORNa frac14 hxethiTHORN

a wxethiTHORNa i continuous bivariate) and a sec-

ondary structure classification (SxethiTHORNa discrete one of three

classes helix (H) sheet (S) or coil (C)) Therefore OxethiTHORNa frac14

ethAxethiTHORNa X

xethiTHORNa S

xethiTHORNa THORN and O

yethiTHORNb frac14 ethAyethiTHORN

b XyethiTHORNb S

yethiTHORNb THORN

Model StructureThe sequence of hidden nodes in the HMM is written asH frac14 ethH1H2 Hm) Each hidden node Hi in the HMMcorresponds to a site observation pair O

xethiTHORNa and O

yethiTHORNb at an

aligned site i in the alignment Mab Initially we treat the align-ment Mab as given a priori but later modify the HMM tomarginalize out an unobserved alignment

The model is parameterized by h hidden states Every hid-den node Hi corresponding to an aligned site i takes an integervalue from 1 to q for the hidden state at node Hi In turn eachhidden state specifies a distribution over a site-class pairethri

a ribTHORN as a function of evolutionary time A site-class

pair consists of two site-classes ria and ri

b Each of the twosite-classes takes an integer value 1 or 2 that isethri

a ribTHORN 2 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg We return to the

specific role of the site-classes pairs in the next sectionThe state of Hi together with the site-class pair ethri

a ribTHORN

and the evolutionary time separating proteins pa and pb tabspecify a distribution over three conditionally independentstochastic processes describing each of the three types of siteobservation pairs Ai frac14 ethAxethiTHORN

a AyethiTHORNb THORN Xi frac14 ethXxethiTHORN

a XyethiTHORNb THORN and

Si frac14 ethSxethiTHORNa S

yethiTHORNb THORN This conditional independence structure al-

lows the likelihood of a site observation pair at an aligned site ito be written as follows

pethOijHi ria r

ib tabTHORN frac14 pethAijHi ri

a rib tabTHORN

z|amino acid evolution

pethXijHi ria r

ib tabTHORN

z|dihedral angle evolution

pethSijHi ria r

ib tabTHORN

z|secondary structure evolution

(1)

The assumption of conditional independence providescomputational tractability allowing us to avoid costly mar-ginalization when certain combinations of data are missing(eg amino acid sequences present but secondary structuresand dihedral angles missing)

Stochastic Processes Modeling EvolutionaryDependenciesEach site-class couples together three time-reversible stochas-tic processes that separately describe the evolution of thethree pairs of observation types as in equation (1) Eachsite-class is intended to capture both physical and evolution-ary features pertaining to sequence and structure Parametersthat correspond to a particular site-class are termed site-classspecific whereas parameters that are shared across all site-classes are termed global The use of site-class specific param-eters such as site-class specific amino acid frequencies anddihedral angle diffusion parameters as described in the nextsection is intended to model site-specific physicalndashchemical

properties (Halpern and Bruno 1998 Koshi and Goldstein1998 Lartillot and Philippe 2004)

Amino Acid EvolutionAs is typical with models of sequence evolution amino acidevolution pethAxethiTHORN

a AyethiTHORNb jHi ri

a rib tabTHORN is described by a

Continuous-Time Markov Chain (CTMC) Each amino acidCTMC is parameterized in the following way the exchange-ability of amino acids is described by a 20 20 symmetricglobal exchangeability matrix S (190 free parameters Whelanand Goldman 2001) a site-class specific set of 20 amino acidequilibrium frequencies Ph

r frac14 diagfp1 p2 p20g (19 freeparameters per site-class) and a site-class specific scaling fac-tor Kh

r (one free parameter per site-class) Together theseparameters define a site-class specific time-reversible aminoacid rate matrix Qh

r frac14 Khr SPh

r The stationary distribution ofQh

r is given by the amino acid equilibrium frequencies Phr

Secondary Structure EvolutionSecondary structure evolution pethSxethiTHORN

a SyethiTHORNb jHi ri

a rib tabTHORN is

also described by a CTMC For the sake of simplicity we useonly three discrete classes to describe secondary structure ateach position helix (H) sheet (S) and random coil (C)

The exchangeability of secondary structure classes at a po-sition is described by a 3 3 symmetric global exchangeabilitymatrix V and a site-class specific set of three secondary struc-ture equilibrium frequencies Xh

r frac14 diagfp1 p2 p3g Togetherthey define a site-class specific time-reversible secondary struc-ture rate matrix Rh

r frac14 VXhr with stationary distribution Xh

r

Dihedral Angle EvolutionCentral to our model is evolutionary dependence betweendihedral angles pethXxethiTHORN

a XyethiTHORNb jHi ri

a rib tabTHORN Typically the

continuous-time evolution of the continuous-state randomvariables is modeled by a diffusive process such as the OUprocess as in Challis and Schmidler (2012) However an OUprocess is not appropriate for dihedral angles as they have anatural periodicity For this reason a bivariate diffusion thatcaptures the periodic nature of dihedral angles the WrappedNormal (WN) diffusion was specifically developed for thispaper in Garcıa-Portugues et al (2017)

Topologically the WN diffusion (see fig 2 for a pictorialexample) can be thought of as the analogue of the OU pro-cess on the torus T2 frac14 frac12p pTHORN frac12p pTHORN The WN diffu-sion arises as the wrapping on T

2 of the following Euclideandiffusion

dXt frac14 AXk2Z2

ethl Xt 2kpTHORNwkethXtTHORNz|coefficient

drift

dt thorn R12

z|coefficient

diffusion

dWt

(2)

where Wt is the two-dimensional Wiener process A is thedrift matrix l 2 T

2 is the stationary mean R is the infinites-imal covariance matrix and

Golden et al doi101093molbevmsx137 MBE

2088Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

wkethhTHORN frac141

2A1Rethh lthorn 2kpTHORNPm2Z2

12A1Rethh lthorn 2mpTHORN h 2 T

2 (3)

is a probability density function (pdf) for k 2 Z2 R standsfor the pdf of a bivariate Gaussian Neth0RTHORN The pdf (eq 3)weights the linear drifts of equation (2) such that they be-come smooth and periodic

It is shown in Garcıa-Portugues et al (2017) that the sta-tionary distribution of the WN diffusion is a WNethlRTHORN whichhas pdf

pWNethhjlRTHORN frac14Xk2Z2

Rethh lthorn 2kpTHORN (4)

Despite involving an infinite sum over Z2 taking just thefirst few terms of this sum provides a tractable and accurateapproximation to the stationary density for most of the re-alistic parameter values

Maximum Likelihood Estimation (MLE) for diffusions isbased on the transition probability density (tpd) whichonly has a tractable analytical form for very few specific pro-cesses A highly tractable and accurate approximation to thetpd is given for the WN diffusion This approximation resultsfrom weighting the tpd of the OU process in the same fashionas the linear drifts are weighted in equation (2) yielding thefollowing multimodal pseudo-tpd

~pethh2jh1A lR tTHORN frac14X

m2Z2

pWNethh2jlmt CtTHORNwmethh1THORN (5)

with h1 h2 2 T2 lm

t frac14 lthorn etAethh1 lthorn 2pmTHORN andCt frac14

ETH t0 esAResAT

ds The pseudo-tpd provides a good ap-proximation to the true tpd in key circumstances 1) t 0since it collapses in the Dirac delta 2) t1 since it con-verges to the stationary distribution 3) high concentrationsince the WN diffusion becomes an OU process Furthermoreit is shown in Garcıa-Portugues et al (2017) that the pseudo-tpd has a lower KullbackndashLeibler divergence with respect tothe true tpd than the Euler and Shoji-Ozaki pseudo-tpds formost typical scenarios and discretization times in the diffu-sion trajectory

A further desirable property of the pseudo-tpd is that itobeys the time-reversibility equation which in terms of ethXxethiTHORN

a X

yethiTHORNb THORN is

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

frac14 ~pethXxethiTHORNa jXyethiTHORN

b AlR tabTHORNpWNethXyethiTHORNb jl 1

2 A1RTHORN

Indeed the WN diffusion is the unique time-reversible dif-fusion with constant diffusion coefficient and stationary pdf(eq 4) in the same way the OU is with respect to a GaussianTime-reversibility is an assumption of the overall model andmany other models of sequence evolution A benefit of time-reversibility in a pairwise model such as ETDBN is that one ofthe proteins in a pair may be arbitrarily chosen as the ances-tor thus avoiding a computationally expensive marginaliza-tion of an unobserved ancestor

The likelihood of a dihedral angle observation pairethXxethiTHORN

a XyethiTHORNb THORN assuming that X

xethiTHORNa is drawn from the stationary

distribution is given by

pethXxethiTHORNa X

yethiTHORNb jHi ri

a rib tabTHORN

frac14 pethXxethiTHORNa X

yethiTHORNb jA lR tabTHORN

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

(6)

A and R are constrained to yield a covariance matrixA1R A parameterization that achieves this is R frac14 diagethr2

1 r2

2THORN and A frac14 etha1r1

r2a3

r2

r1a3 a2THORN a1a2 gt a2

3 a1 and a2 arethe drift components for the and w dihedral angles re-spectively Dependence (correlation) between the dihedralangles is captured by a3 A depiction of a WN diffusion withgiven drift and diffusion parameters is shown in figure 2

Site-Classes Constant Evolution and Jump EventsWe now turn to the meaning of the site-class pairs Twomodes of evolution are modeled constant evolution andjump events Constant evolution occurs when the site-classstarting in protein pa at aligned site i ri

a is the same as thesite-class ending in protein pb at aligned site i ri

b that isri

a frac14 rib Thus the distribution over observation pairs at a

site is specified by a single site-classAs already stated a site-class specifies the parameters of

the three conditionally independent stochastic processes de-scribing amino acid dihedral angle and secondary structureevolution A limitation of ldquoconstant evolutionrdquo is that thecoupling between the three stochastic processes is somewhatweak This in part stems from the time-reversibility of thestochastic processesmdashswapping the order of one of the threeobservation pairs at a homologous site eg (Glycine Proline)instead of (Proline Glycine) does not alter the likelihood inequation (1) Alternatively restated from a generative per-spective a lsquodirectional couplingrsquo of an amino acid interchangedoes not inform the direction of change in dihedral angle orsecondary structure For example replacing a glycine in an a-helix in one protein with a proline at the homologous posi-tion in a second protein would be expected to break the a-helix in the second protein and to strongly inform the plau-sible dihedral angle conformations in the second protein

Ideally we would consider a model in which the underlyingsite-classes were not fixed over the evolutionary trajectoryseparating the two proteins as in the case of constant evo-lution as described above but instead were able to lsquoevolversquo intime This would allow occasional switches in the underlyingsite-class at a particular homologous site which would createa stronger dependency between amino acid dihedral angleand secondary structure evolution that furthermore capturesthe directional coupling we desire Such an approach is con-sidered computationally intractable due to the introduction ofcontext-dependence when having to consider neighboring de-pendencies amongst evolutionary trajectories at adjacent sites

In order to approximate this lsquoidealrsquo model in a computa-tionally efficient manner we introduce the notion of a jump

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2089Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 2: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

benefit of LFENMs is that they do not assume independenceof atomic coordinates and take into account non-local de-pendencies due to physical interactions In their current for-mulation LFENMs do not distinguish between the differingchemical nature of different amino acids and therefore do notaccount for the variable effect of sequence mutation on pro-tein structure evolution

Rather than using a Cartesian coordinate representationour model ETDBN uses a dihedral angle representation mo-tivated by the non-evolutionary TorusDBN model (Boomsmaet al 2008 2014) TorusDBN represents a single protein struc-ture as a sequence of ethwTHORN dihedral angle pairs which aremodeled using continuous bivariate angular distributions(Frellsen et al 2012) Likewise ETDBN treats protein structureas a random walk in space again making use of the and wdihedral angles (top of fig 1)

The dihedral angle representation is informed by the chem-ical nature of peptide bonds Each amino acid in a proteinpeptide chain is covalently bonded to the next via a peptidebond Peptide bonds have a partial double bond nature thatresults in a planar configuration of atoms in space This config-uration allows the protein backbone structure to be largelydescribed in terms of a series of and w dihedral angles thatdefines the relationship between the planes in three-dimensional space A benefit of this representation is that itbypasses the need for structural alignment unlike in models onCartesian coordinates which typically need to additionally su-perimpose the structures for comparison purposes (Hermanet al 2014) Accordingly having to account for superimpositionintroduces an additional source of uncertainty A further ad-vantage of the dihedral angle representation is that there arefewer degrees of freedom per amino acid and therefore typicallyfewer parameters required in order to model their evolution

The evolution of dihedral angles in ETDBN is modeledusing a novel stochastic diffusion process developed inGarcıa-Portugues et al (2017) In addition to this a couplingis introduced such that an amino acid change can lead to ajump in dihedral angles and a change in diffusion processallowing the model to capture changes in amino acid that aredirectionally coupled with changes in dihedral angle or sec-ondary structure As in Challis and Schmidler (2012) andHerman et al (2014) the insertion and deletion (indel) evo-lutionary process is also modeled in order to account foralignment uncertainty (Thorne et al 1992)

The OU processes used in Challis and Schmidler (2012)and Herman et al (2014) ignore bond lengths and treat Ca

atoms as evolving independently for the sake of computa-tionally tractability Furthermore the OU process makesGaussian assumptions From a generative perspective theseproperties will lead to evolved proteins with Ca atoms thatare unnaturally dispersed in space Bond lengths are also ig-nored in ETDBN but can be plausibly fixed or modeled As aresult it is expected that the use of angular diffusions willmuch more naturally capture the underlying protein struc-ture manifold

Two or more homologous proteins will share a commonancestor which leads to underlying tree-like dependenciesThese dependencies manifest themselves most noticeably in

the degree of amino acid sequence similarity between twohomologous proteins The strength of these dependencies isassumed to be a result of two major factors the time since thecommon ancestor and the rate of evolution

Failing to account for evolutionary dependencies can leadto false conclusions (Felsenstein 1985) whereas accounting forevolutionary dependencies allows information from homolo-gous proteins to be incorporated in a principled manner Thiscan lead to more accurate inferences such as the prediction ofa protein structure from a homologous protein sequence andstructure known as homology modeling (Arnold et al 2006)Stochastic models such as ETDBN are not expected to com-pete with homology modeling software such as SWISS-MODEL (Arnold et al 2006) However they allow for estima-tion of evolutionary parameters and statements about uncer-tainty to be made in a statistically rigorous manner

Most models of structural evolution ignore dependenciesamongst sites because of the increased computational de-mand and complexity associated with such models Thesedependencies are expected to influence patterns of evolutionspecifically patterns of amino acid substitution The currentmodel deals with local dependencies onlymdashdependencies thatare expected to arise due to interactions between neighboringamino acids for example between amino acids in an a-helixETDBN does not account for global dependenciesmdashdepen-dencies that result in the globular nature of proteins(Boomsma et al 2008) In ETDBN we attempt to model localdependencies only by using a Hidden Markov Model (HMM)to capture dependencies amongst neighboring aligned posi-tions HMMs such as PASSML (Lio et al 1998) have beensuccessfully used to predict protein secondary structurefrom aligned sequences however these models typicallyhave the disadvantage that they assume a canonical secondarystructure shared amongst all the sequences being analyzedThis restricts analysis to closely related sequences where con-servation of secondary structure is a reasonable assumptionETDBN does not assume a canonical secondary structure butinstead uses a phylogenetic HMM approach similar to Siepeland Haussler (2004) that assumes dependencies between evo-lutionary processes at neighboring aligned positions

Parameters of ETDBN were estimated using 1200 homol-ogous protein pairs from the HOMSTRAD database(Mizuguchi et al 1998) The resulting model provides a real-istic prior distribution over proteins and protein structureevolution in comparison to previous stochastic modelsDoing so enables biological insights into the relationship be-tween sequence and structure evolution such as patterns ofamino acid change that are informative of patterns of struc-tural change (Grishin 2001) It was with these features in mindthat ETDBN was developed

Evolutionary Model

OverviewETDBN is a dynamic Bayesian network model of local proteinsequence and structure evolution along a pair of aligned ho-mologous proteins pa and pb ETDBN can be can be viewed asan HMM (see fig 1) Each hidden node of the HMM

Golden et al doi101093molbevmsx137 MBE

2086Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

corresponding to an aligned position adopts an evolutionaryhidden state specifying a distribution over three different ob-servations pairs a pair of amino acid characters a pair ofdihedral angles and a pair of secondary structures classifica-tions A transition probability matrix specifies neighboringdependencies between adjacent evolutionary states For ex-ample transitions along the alignment between hidden statesencoding predominantly a-helix evolution would be ex-pected to occur more frequently than transitions betweenan evolutionary hidden state encoding predominantly a-helixevolution and another encoding predominantly b-sheetevolution

Ideally the underlying hidden states would not just varyacross the length of the alignment as captured by the HMMin the current model but also evolve along the branches ofthe phylogenetic tree This remains computationally intrac-table at present Allowing the hidden states to evolve alongthe tree would allow capturing large structural changes eveninduced by a single mutation For now we model such eventsusing a jump model (see below)

Partially in order to mitigate this each hidden state speci-fies a distribution over a pair of site-classes at each alignedposition This gives rise to the possibility of a lsquojump eventrsquo Ajump event allows a large change in dihedral angle or

secondary structure (eg helix to sheet) to occur at a givenaligned position and also introduces a directional couplingbetween changes in amino acid that are informative ofchanges in dihedral angle or secondary structureconformation

Observation TypesThe two proteins pa and pb in a homologous pair are asso-ciated with a pair of observation sequences Oa and Ob ob-tained from experimental data respectively An ith siteobservation pair Oi frac14 ethOxethiTHORN

a OyethiTHORNb THORN is associated with every

aligned site i in an alignment Mab of pa and pb where Miab

2 f yx x

yeth THORNg specifies the homology relationship at

position i of the alignment (homologous deletion with re-spect to pa and insertion with respect to pa respectively) i istaken to run from 1 to m m is the length of the alignmentMab and x 2 f1 jpajg and y 2 f1 jpbjg specifythe indices of the positions in pa and pb respectively jpajand jpaj give the number of sites in pa and pb respectively

Each site observation OxethiTHORNa and O

yethiTHORNb contains amino acid

and structural information corresponding to the two Ca

atoms at aligned site i belonging to each of the two proteinsA site observation corresponding to a particular protein ataligned site i O

xethiTHORNa is comprised of three different data types

FIG 1 Above dihedral angle representation A small section of a single protein backbone (three amino acids) with and w dihedral angles showntogether with Ca atoms which attach to the amino acid side-chains Each amino acid side-chain determines the characteristic nature of each aminoacid Every amino acid position corresponds to a hidden node in the HMM below Note that we only show a single protein whereas the modelconsiders a pair Below depiction of HMM architecture of ETDBN where each H along the horizontal axis represents an evolutionary hidden nodeThe horizontal edges between evolutionary hidden nodes encode neighboring dependencies between aligned sites The arrows between theevolutionary hidden nodes and site-class pair nodes encode the conditional independence between the observation pair variables Axi

a Ayi

b (aminoacid site pair) Xxi

a frac14 hxia w

xia i X

yi

b frac14 hyi

b wyi

b i (dihedral angle site pair) and Sxia S

yi

b (secondary structure class site pair) The circles representcontinuous variables and the rectangles represent discrete variables

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2087Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

associated with the Ca atom an amino acid (AxethiTHORNa discrete

one of twenty canonical amino acids) and w dihedral an-gles (X

xethiTHORNa frac14 hxethiTHORN

a wxethiTHORNa i continuous bivariate) and a sec-

ondary structure classification (SxethiTHORNa discrete one of three

classes helix (H) sheet (S) or coil (C)) Therefore OxethiTHORNa frac14

ethAxethiTHORNa X

xethiTHORNa S

xethiTHORNa THORN and O

yethiTHORNb frac14 ethAyethiTHORN

b XyethiTHORNb S

yethiTHORNb THORN

Model StructureThe sequence of hidden nodes in the HMM is written asH frac14 ethH1H2 Hm) Each hidden node Hi in the HMMcorresponds to a site observation pair O

xethiTHORNa and O

yethiTHORNb at an

aligned site i in the alignment Mab Initially we treat the align-ment Mab as given a priori but later modify the HMM tomarginalize out an unobserved alignment

The model is parameterized by h hidden states Every hid-den node Hi corresponding to an aligned site i takes an integervalue from 1 to q for the hidden state at node Hi In turn eachhidden state specifies a distribution over a site-class pairethri

a ribTHORN as a function of evolutionary time A site-class

pair consists of two site-classes ria and ri

b Each of the twosite-classes takes an integer value 1 or 2 that isethri

a ribTHORN 2 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg We return to the

specific role of the site-classes pairs in the next sectionThe state of Hi together with the site-class pair ethri

a ribTHORN

and the evolutionary time separating proteins pa and pb tabspecify a distribution over three conditionally independentstochastic processes describing each of the three types of siteobservation pairs Ai frac14 ethAxethiTHORN

a AyethiTHORNb THORN Xi frac14 ethXxethiTHORN

a XyethiTHORNb THORN and

Si frac14 ethSxethiTHORNa S

yethiTHORNb THORN This conditional independence structure al-

lows the likelihood of a site observation pair at an aligned site ito be written as follows

pethOijHi ria r

ib tabTHORN frac14 pethAijHi ri

a rib tabTHORN

z|amino acid evolution

pethXijHi ria r

ib tabTHORN

z|dihedral angle evolution

pethSijHi ria r

ib tabTHORN

z|secondary structure evolution

(1)

The assumption of conditional independence providescomputational tractability allowing us to avoid costly mar-ginalization when certain combinations of data are missing(eg amino acid sequences present but secondary structuresand dihedral angles missing)

Stochastic Processes Modeling EvolutionaryDependenciesEach site-class couples together three time-reversible stochas-tic processes that separately describe the evolution of thethree pairs of observation types as in equation (1) Eachsite-class is intended to capture both physical and evolution-ary features pertaining to sequence and structure Parametersthat correspond to a particular site-class are termed site-classspecific whereas parameters that are shared across all site-classes are termed global The use of site-class specific param-eters such as site-class specific amino acid frequencies anddihedral angle diffusion parameters as described in the nextsection is intended to model site-specific physicalndashchemical

properties (Halpern and Bruno 1998 Koshi and Goldstein1998 Lartillot and Philippe 2004)

Amino Acid EvolutionAs is typical with models of sequence evolution amino acidevolution pethAxethiTHORN

a AyethiTHORNb jHi ri

a rib tabTHORN is described by a

Continuous-Time Markov Chain (CTMC) Each amino acidCTMC is parameterized in the following way the exchange-ability of amino acids is described by a 20 20 symmetricglobal exchangeability matrix S (190 free parameters Whelanand Goldman 2001) a site-class specific set of 20 amino acidequilibrium frequencies Ph

r frac14 diagfp1 p2 p20g (19 freeparameters per site-class) and a site-class specific scaling fac-tor Kh

r (one free parameter per site-class) Together theseparameters define a site-class specific time-reversible aminoacid rate matrix Qh

r frac14 Khr SPh

r The stationary distribution ofQh

r is given by the amino acid equilibrium frequencies Phr

Secondary Structure EvolutionSecondary structure evolution pethSxethiTHORN

a SyethiTHORNb jHi ri

a rib tabTHORN is

also described by a CTMC For the sake of simplicity we useonly three discrete classes to describe secondary structure ateach position helix (H) sheet (S) and random coil (C)

The exchangeability of secondary structure classes at a po-sition is described by a 3 3 symmetric global exchangeabilitymatrix V and a site-class specific set of three secondary struc-ture equilibrium frequencies Xh

r frac14 diagfp1 p2 p3g Togetherthey define a site-class specific time-reversible secondary struc-ture rate matrix Rh

r frac14 VXhr with stationary distribution Xh

r

Dihedral Angle EvolutionCentral to our model is evolutionary dependence betweendihedral angles pethXxethiTHORN

a XyethiTHORNb jHi ri

a rib tabTHORN Typically the

continuous-time evolution of the continuous-state randomvariables is modeled by a diffusive process such as the OUprocess as in Challis and Schmidler (2012) However an OUprocess is not appropriate for dihedral angles as they have anatural periodicity For this reason a bivariate diffusion thatcaptures the periodic nature of dihedral angles the WrappedNormal (WN) diffusion was specifically developed for thispaper in Garcıa-Portugues et al (2017)

Topologically the WN diffusion (see fig 2 for a pictorialexample) can be thought of as the analogue of the OU pro-cess on the torus T2 frac14 frac12p pTHORN frac12p pTHORN The WN diffu-sion arises as the wrapping on T

2 of the following Euclideandiffusion

dXt frac14 AXk2Z2

ethl Xt 2kpTHORNwkethXtTHORNz|coefficient

drift

dt thorn R12

z|coefficient

diffusion

dWt

(2)

where Wt is the two-dimensional Wiener process A is thedrift matrix l 2 T

2 is the stationary mean R is the infinites-imal covariance matrix and

Golden et al doi101093molbevmsx137 MBE

2088Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

wkethhTHORN frac141

2A1Rethh lthorn 2kpTHORNPm2Z2

12A1Rethh lthorn 2mpTHORN h 2 T

2 (3)

is a probability density function (pdf) for k 2 Z2 R standsfor the pdf of a bivariate Gaussian Neth0RTHORN The pdf (eq 3)weights the linear drifts of equation (2) such that they be-come smooth and periodic

It is shown in Garcıa-Portugues et al (2017) that the sta-tionary distribution of the WN diffusion is a WNethlRTHORN whichhas pdf

pWNethhjlRTHORN frac14Xk2Z2

Rethh lthorn 2kpTHORN (4)

Despite involving an infinite sum over Z2 taking just thefirst few terms of this sum provides a tractable and accurateapproximation to the stationary density for most of the re-alistic parameter values

Maximum Likelihood Estimation (MLE) for diffusions isbased on the transition probability density (tpd) whichonly has a tractable analytical form for very few specific pro-cesses A highly tractable and accurate approximation to thetpd is given for the WN diffusion This approximation resultsfrom weighting the tpd of the OU process in the same fashionas the linear drifts are weighted in equation (2) yielding thefollowing multimodal pseudo-tpd

~pethh2jh1A lR tTHORN frac14X

m2Z2

pWNethh2jlmt CtTHORNwmethh1THORN (5)

with h1 h2 2 T2 lm

t frac14 lthorn etAethh1 lthorn 2pmTHORN andCt frac14

ETH t0 esAResAT

ds The pseudo-tpd provides a good ap-proximation to the true tpd in key circumstances 1) t 0since it collapses in the Dirac delta 2) t1 since it con-verges to the stationary distribution 3) high concentrationsince the WN diffusion becomes an OU process Furthermoreit is shown in Garcıa-Portugues et al (2017) that the pseudo-tpd has a lower KullbackndashLeibler divergence with respect tothe true tpd than the Euler and Shoji-Ozaki pseudo-tpds formost typical scenarios and discretization times in the diffu-sion trajectory

A further desirable property of the pseudo-tpd is that itobeys the time-reversibility equation which in terms of ethXxethiTHORN

a X

yethiTHORNb THORN is

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

frac14 ~pethXxethiTHORNa jXyethiTHORN

b AlR tabTHORNpWNethXyethiTHORNb jl 1

2 A1RTHORN

Indeed the WN diffusion is the unique time-reversible dif-fusion with constant diffusion coefficient and stationary pdf(eq 4) in the same way the OU is with respect to a GaussianTime-reversibility is an assumption of the overall model andmany other models of sequence evolution A benefit of time-reversibility in a pairwise model such as ETDBN is that one ofthe proteins in a pair may be arbitrarily chosen as the ances-tor thus avoiding a computationally expensive marginaliza-tion of an unobserved ancestor

The likelihood of a dihedral angle observation pairethXxethiTHORN

a XyethiTHORNb THORN assuming that X

xethiTHORNa is drawn from the stationary

distribution is given by

pethXxethiTHORNa X

yethiTHORNb jHi ri

a rib tabTHORN

frac14 pethXxethiTHORNa X

yethiTHORNb jA lR tabTHORN

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

(6)

A and R are constrained to yield a covariance matrixA1R A parameterization that achieves this is R frac14 diagethr2

1 r2

2THORN and A frac14 etha1r1

r2a3

r2

r1a3 a2THORN a1a2 gt a2

3 a1 and a2 arethe drift components for the and w dihedral angles re-spectively Dependence (correlation) between the dihedralangles is captured by a3 A depiction of a WN diffusion withgiven drift and diffusion parameters is shown in figure 2

Site-Classes Constant Evolution and Jump EventsWe now turn to the meaning of the site-class pairs Twomodes of evolution are modeled constant evolution andjump events Constant evolution occurs when the site-classstarting in protein pa at aligned site i ri

a is the same as thesite-class ending in protein pb at aligned site i ri

b that isri

a frac14 rib Thus the distribution over observation pairs at a

site is specified by a single site-classAs already stated a site-class specifies the parameters of

the three conditionally independent stochastic processes de-scribing amino acid dihedral angle and secondary structureevolution A limitation of ldquoconstant evolutionrdquo is that thecoupling between the three stochastic processes is somewhatweak This in part stems from the time-reversibility of thestochastic processesmdashswapping the order of one of the threeobservation pairs at a homologous site eg (Glycine Proline)instead of (Proline Glycine) does not alter the likelihood inequation (1) Alternatively restated from a generative per-spective a lsquodirectional couplingrsquo of an amino acid interchangedoes not inform the direction of change in dihedral angle orsecondary structure For example replacing a glycine in an a-helix in one protein with a proline at the homologous posi-tion in a second protein would be expected to break the a-helix in the second protein and to strongly inform the plau-sible dihedral angle conformations in the second protein

Ideally we would consider a model in which the underlyingsite-classes were not fixed over the evolutionary trajectoryseparating the two proteins as in the case of constant evo-lution as described above but instead were able to lsquoevolversquo intime This would allow occasional switches in the underlyingsite-class at a particular homologous site which would createa stronger dependency between amino acid dihedral angleand secondary structure evolution that furthermore capturesthe directional coupling we desire Such an approach is con-sidered computationally intractable due to the introduction ofcontext-dependence when having to consider neighboring de-pendencies amongst evolutionary trajectories at adjacent sites

In order to approximate this lsquoidealrsquo model in a computa-tionally efficient manner we introduce the notion of a jump

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2089Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 3: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

corresponding to an aligned position adopts an evolutionaryhidden state specifying a distribution over three different ob-servations pairs a pair of amino acid characters a pair ofdihedral angles and a pair of secondary structures classifica-tions A transition probability matrix specifies neighboringdependencies between adjacent evolutionary states For ex-ample transitions along the alignment between hidden statesencoding predominantly a-helix evolution would be ex-pected to occur more frequently than transitions betweenan evolutionary hidden state encoding predominantly a-helixevolution and another encoding predominantly b-sheetevolution

Ideally the underlying hidden states would not just varyacross the length of the alignment as captured by the HMMin the current model but also evolve along the branches ofthe phylogenetic tree This remains computationally intrac-table at present Allowing the hidden states to evolve alongthe tree would allow capturing large structural changes eveninduced by a single mutation For now we model such eventsusing a jump model (see below)

Partially in order to mitigate this each hidden state speci-fies a distribution over a pair of site-classes at each alignedposition This gives rise to the possibility of a lsquojump eventrsquo Ajump event allows a large change in dihedral angle or

secondary structure (eg helix to sheet) to occur at a givenaligned position and also introduces a directional couplingbetween changes in amino acid that are informative ofchanges in dihedral angle or secondary structureconformation

Observation TypesThe two proteins pa and pb in a homologous pair are asso-ciated with a pair of observation sequences Oa and Ob ob-tained from experimental data respectively An ith siteobservation pair Oi frac14 ethOxethiTHORN

a OyethiTHORNb THORN is associated with every

aligned site i in an alignment Mab of pa and pb where Miab

2 f yx x

yeth THORNg specifies the homology relationship at

position i of the alignment (homologous deletion with re-spect to pa and insertion with respect to pa respectively) i istaken to run from 1 to m m is the length of the alignmentMab and x 2 f1 jpajg and y 2 f1 jpbjg specifythe indices of the positions in pa and pb respectively jpajand jpaj give the number of sites in pa and pb respectively

Each site observation OxethiTHORNa and O

yethiTHORNb contains amino acid

and structural information corresponding to the two Ca

atoms at aligned site i belonging to each of the two proteinsA site observation corresponding to a particular protein ataligned site i O

xethiTHORNa is comprised of three different data types

FIG 1 Above dihedral angle representation A small section of a single protein backbone (three amino acids) with and w dihedral angles showntogether with Ca atoms which attach to the amino acid side-chains Each amino acid side-chain determines the characteristic nature of each aminoacid Every amino acid position corresponds to a hidden node in the HMM below Note that we only show a single protein whereas the modelconsiders a pair Below depiction of HMM architecture of ETDBN where each H along the horizontal axis represents an evolutionary hidden nodeThe horizontal edges between evolutionary hidden nodes encode neighboring dependencies between aligned sites The arrows between theevolutionary hidden nodes and site-class pair nodes encode the conditional independence between the observation pair variables Axi

a Ayi

b (aminoacid site pair) Xxi

a frac14 hxia w

xia i X

yi

b frac14 hyi

b wyi

b i (dihedral angle site pair) and Sxia S

yi

b (secondary structure class site pair) The circles representcontinuous variables and the rectangles represent discrete variables

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2087Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

associated with the Ca atom an amino acid (AxethiTHORNa discrete

one of twenty canonical amino acids) and w dihedral an-gles (X

xethiTHORNa frac14 hxethiTHORN

a wxethiTHORNa i continuous bivariate) and a sec-

ondary structure classification (SxethiTHORNa discrete one of three

classes helix (H) sheet (S) or coil (C)) Therefore OxethiTHORNa frac14

ethAxethiTHORNa X

xethiTHORNa S

xethiTHORNa THORN and O

yethiTHORNb frac14 ethAyethiTHORN

b XyethiTHORNb S

yethiTHORNb THORN

Model StructureThe sequence of hidden nodes in the HMM is written asH frac14 ethH1H2 Hm) Each hidden node Hi in the HMMcorresponds to a site observation pair O

xethiTHORNa and O

yethiTHORNb at an

aligned site i in the alignment Mab Initially we treat the align-ment Mab as given a priori but later modify the HMM tomarginalize out an unobserved alignment

The model is parameterized by h hidden states Every hid-den node Hi corresponding to an aligned site i takes an integervalue from 1 to q for the hidden state at node Hi In turn eachhidden state specifies a distribution over a site-class pairethri

a ribTHORN as a function of evolutionary time A site-class

pair consists of two site-classes ria and ri

b Each of the twosite-classes takes an integer value 1 or 2 that isethri

a ribTHORN 2 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg We return to the

specific role of the site-classes pairs in the next sectionThe state of Hi together with the site-class pair ethri

a ribTHORN

and the evolutionary time separating proteins pa and pb tabspecify a distribution over three conditionally independentstochastic processes describing each of the three types of siteobservation pairs Ai frac14 ethAxethiTHORN

a AyethiTHORNb THORN Xi frac14 ethXxethiTHORN

a XyethiTHORNb THORN and

Si frac14 ethSxethiTHORNa S

yethiTHORNb THORN This conditional independence structure al-

lows the likelihood of a site observation pair at an aligned site ito be written as follows

pethOijHi ria r

ib tabTHORN frac14 pethAijHi ri

a rib tabTHORN

z|amino acid evolution

pethXijHi ria r

ib tabTHORN

z|dihedral angle evolution

pethSijHi ria r

ib tabTHORN

z|secondary structure evolution

(1)

The assumption of conditional independence providescomputational tractability allowing us to avoid costly mar-ginalization when certain combinations of data are missing(eg amino acid sequences present but secondary structuresand dihedral angles missing)

Stochastic Processes Modeling EvolutionaryDependenciesEach site-class couples together three time-reversible stochas-tic processes that separately describe the evolution of thethree pairs of observation types as in equation (1) Eachsite-class is intended to capture both physical and evolution-ary features pertaining to sequence and structure Parametersthat correspond to a particular site-class are termed site-classspecific whereas parameters that are shared across all site-classes are termed global The use of site-class specific param-eters such as site-class specific amino acid frequencies anddihedral angle diffusion parameters as described in the nextsection is intended to model site-specific physicalndashchemical

properties (Halpern and Bruno 1998 Koshi and Goldstein1998 Lartillot and Philippe 2004)

Amino Acid EvolutionAs is typical with models of sequence evolution amino acidevolution pethAxethiTHORN

a AyethiTHORNb jHi ri

a rib tabTHORN is described by a

Continuous-Time Markov Chain (CTMC) Each amino acidCTMC is parameterized in the following way the exchange-ability of amino acids is described by a 20 20 symmetricglobal exchangeability matrix S (190 free parameters Whelanand Goldman 2001) a site-class specific set of 20 amino acidequilibrium frequencies Ph

r frac14 diagfp1 p2 p20g (19 freeparameters per site-class) and a site-class specific scaling fac-tor Kh

r (one free parameter per site-class) Together theseparameters define a site-class specific time-reversible aminoacid rate matrix Qh

r frac14 Khr SPh

r The stationary distribution ofQh

r is given by the amino acid equilibrium frequencies Phr

Secondary Structure EvolutionSecondary structure evolution pethSxethiTHORN

a SyethiTHORNb jHi ri

a rib tabTHORN is

also described by a CTMC For the sake of simplicity we useonly three discrete classes to describe secondary structure ateach position helix (H) sheet (S) and random coil (C)

The exchangeability of secondary structure classes at a po-sition is described by a 3 3 symmetric global exchangeabilitymatrix V and a site-class specific set of three secondary struc-ture equilibrium frequencies Xh

r frac14 diagfp1 p2 p3g Togetherthey define a site-class specific time-reversible secondary struc-ture rate matrix Rh

r frac14 VXhr with stationary distribution Xh

r

Dihedral Angle EvolutionCentral to our model is evolutionary dependence betweendihedral angles pethXxethiTHORN

a XyethiTHORNb jHi ri

a rib tabTHORN Typically the

continuous-time evolution of the continuous-state randomvariables is modeled by a diffusive process such as the OUprocess as in Challis and Schmidler (2012) However an OUprocess is not appropriate for dihedral angles as they have anatural periodicity For this reason a bivariate diffusion thatcaptures the periodic nature of dihedral angles the WrappedNormal (WN) diffusion was specifically developed for thispaper in Garcıa-Portugues et al (2017)

Topologically the WN diffusion (see fig 2 for a pictorialexample) can be thought of as the analogue of the OU pro-cess on the torus T2 frac14 frac12p pTHORN frac12p pTHORN The WN diffu-sion arises as the wrapping on T

2 of the following Euclideandiffusion

dXt frac14 AXk2Z2

ethl Xt 2kpTHORNwkethXtTHORNz|coefficient

drift

dt thorn R12

z|coefficient

diffusion

dWt

(2)

where Wt is the two-dimensional Wiener process A is thedrift matrix l 2 T

2 is the stationary mean R is the infinites-imal covariance matrix and

Golden et al doi101093molbevmsx137 MBE

2088Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

wkethhTHORN frac141

2A1Rethh lthorn 2kpTHORNPm2Z2

12A1Rethh lthorn 2mpTHORN h 2 T

2 (3)

is a probability density function (pdf) for k 2 Z2 R standsfor the pdf of a bivariate Gaussian Neth0RTHORN The pdf (eq 3)weights the linear drifts of equation (2) such that they be-come smooth and periodic

It is shown in Garcıa-Portugues et al (2017) that the sta-tionary distribution of the WN diffusion is a WNethlRTHORN whichhas pdf

pWNethhjlRTHORN frac14Xk2Z2

Rethh lthorn 2kpTHORN (4)

Despite involving an infinite sum over Z2 taking just thefirst few terms of this sum provides a tractable and accurateapproximation to the stationary density for most of the re-alistic parameter values

Maximum Likelihood Estimation (MLE) for diffusions isbased on the transition probability density (tpd) whichonly has a tractable analytical form for very few specific pro-cesses A highly tractable and accurate approximation to thetpd is given for the WN diffusion This approximation resultsfrom weighting the tpd of the OU process in the same fashionas the linear drifts are weighted in equation (2) yielding thefollowing multimodal pseudo-tpd

~pethh2jh1A lR tTHORN frac14X

m2Z2

pWNethh2jlmt CtTHORNwmethh1THORN (5)

with h1 h2 2 T2 lm

t frac14 lthorn etAethh1 lthorn 2pmTHORN andCt frac14

ETH t0 esAResAT

ds The pseudo-tpd provides a good ap-proximation to the true tpd in key circumstances 1) t 0since it collapses in the Dirac delta 2) t1 since it con-verges to the stationary distribution 3) high concentrationsince the WN diffusion becomes an OU process Furthermoreit is shown in Garcıa-Portugues et al (2017) that the pseudo-tpd has a lower KullbackndashLeibler divergence with respect tothe true tpd than the Euler and Shoji-Ozaki pseudo-tpds formost typical scenarios and discretization times in the diffu-sion trajectory

A further desirable property of the pseudo-tpd is that itobeys the time-reversibility equation which in terms of ethXxethiTHORN

a X

yethiTHORNb THORN is

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

frac14 ~pethXxethiTHORNa jXyethiTHORN

b AlR tabTHORNpWNethXyethiTHORNb jl 1

2 A1RTHORN

Indeed the WN diffusion is the unique time-reversible dif-fusion with constant diffusion coefficient and stationary pdf(eq 4) in the same way the OU is with respect to a GaussianTime-reversibility is an assumption of the overall model andmany other models of sequence evolution A benefit of time-reversibility in a pairwise model such as ETDBN is that one ofthe proteins in a pair may be arbitrarily chosen as the ances-tor thus avoiding a computationally expensive marginaliza-tion of an unobserved ancestor

The likelihood of a dihedral angle observation pairethXxethiTHORN

a XyethiTHORNb THORN assuming that X

xethiTHORNa is drawn from the stationary

distribution is given by

pethXxethiTHORNa X

yethiTHORNb jHi ri

a rib tabTHORN

frac14 pethXxethiTHORNa X

yethiTHORNb jA lR tabTHORN

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

(6)

A and R are constrained to yield a covariance matrixA1R A parameterization that achieves this is R frac14 diagethr2

1 r2

2THORN and A frac14 etha1r1

r2a3

r2

r1a3 a2THORN a1a2 gt a2

3 a1 and a2 arethe drift components for the and w dihedral angles re-spectively Dependence (correlation) between the dihedralangles is captured by a3 A depiction of a WN diffusion withgiven drift and diffusion parameters is shown in figure 2

Site-Classes Constant Evolution and Jump EventsWe now turn to the meaning of the site-class pairs Twomodes of evolution are modeled constant evolution andjump events Constant evolution occurs when the site-classstarting in protein pa at aligned site i ri

a is the same as thesite-class ending in protein pb at aligned site i ri

b that isri

a frac14 rib Thus the distribution over observation pairs at a

site is specified by a single site-classAs already stated a site-class specifies the parameters of

the three conditionally independent stochastic processes de-scribing amino acid dihedral angle and secondary structureevolution A limitation of ldquoconstant evolutionrdquo is that thecoupling between the three stochastic processes is somewhatweak This in part stems from the time-reversibility of thestochastic processesmdashswapping the order of one of the threeobservation pairs at a homologous site eg (Glycine Proline)instead of (Proline Glycine) does not alter the likelihood inequation (1) Alternatively restated from a generative per-spective a lsquodirectional couplingrsquo of an amino acid interchangedoes not inform the direction of change in dihedral angle orsecondary structure For example replacing a glycine in an a-helix in one protein with a proline at the homologous posi-tion in a second protein would be expected to break the a-helix in the second protein and to strongly inform the plau-sible dihedral angle conformations in the second protein

Ideally we would consider a model in which the underlyingsite-classes were not fixed over the evolutionary trajectoryseparating the two proteins as in the case of constant evo-lution as described above but instead were able to lsquoevolversquo intime This would allow occasional switches in the underlyingsite-class at a particular homologous site which would createa stronger dependency between amino acid dihedral angleand secondary structure evolution that furthermore capturesthe directional coupling we desire Such an approach is con-sidered computationally intractable due to the introduction ofcontext-dependence when having to consider neighboring de-pendencies amongst evolutionary trajectories at adjacent sites

In order to approximate this lsquoidealrsquo model in a computa-tionally efficient manner we introduce the notion of a jump

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2089Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 4: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

associated with the Ca atom an amino acid (AxethiTHORNa discrete

one of twenty canonical amino acids) and w dihedral an-gles (X

xethiTHORNa frac14 hxethiTHORN

a wxethiTHORNa i continuous bivariate) and a sec-

ondary structure classification (SxethiTHORNa discrete one of three

classes helix (H) sheet (S) or coil (C)) Therefore OxethiTHORNa frac14

ethAxethiTHORNa X

xethiTHORNa S

xethiTHORNa THORN and O

yethiTHORNb frac14 ethAyethiTHORN

b XyethiTHORNb S

yethiTHORNb THORN

Model StructureThe sequence of hidden nodes in the HMM is written asH frac14 ethH1H2 Hm) Each hidden node Hi in the HMMcorresponds to a site observation pair O

xethiTHORNa and O

yethiTHORNb at an

aligned site i in the alignment Mab Initially we treat the align-ment Mab as given a priori but later modify the HMM tomarginalize out an unobserved alignment

The model is parameterized by h hidden states Every hid-den node Hi corresponding to an aligned site i takes an integervalue from 1 to q for the hidden state at node Hi In turn eachhidden state specifies a distribution over a site-class pairethri

a ribTHORN as a function of evolutionary time A site-class

pair consists of two site-classes ria and ri

b Each of the twosite-classes takes an integer value 1 or 2 that isethri

a ribTHORN 2 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg We return to the

specific role of the site-classes pairs in the next sectionThe state of Hi together with the site-class pair ethri

a ribTHORN

and the evolutionary time separating proteins pa and pb tabspecify a distribution over three conditionally independentstochastic processes describing each of the three types of siteobservation pairs Ai frac14 ethAxethiTHORN

a AyethiTHORNb THORN Xi frac14 ethXxethiTHORN

a XyethiTHORNb THORN and

Si frac14 ethSxethiTHORNa S

yethiTHORNb THORN This conditional independence structure al-

lows the likelihood of a site observation pair at an aligned site ito be written as follows

pethOijHi ria r

ib tabTHORN frac14 pethAijHi ri

a rib tabTHORN

z|amino acid evolution

pethXijHi ria r

ib tabTHORN

z|dihedral angle evolution

pethSijHi ria r

ib tabTHORN

z|secondary structure evolution

(1)

The assumption of conditional independence providescomputational tractability allowing us to avoid costly mar-ginalization when certain combinations of data are missing(eg amino acid sequences present but secondary structuresand dihedral angles missing)

Stochastic Processes Modeling EvolutionaryDependenciesEach site-class couples together three time-reversible stochas-tic processes that separately describe the evolution of thethree pairs of observation types as in equation (1) Eachsite-class is intended to capture both physical and evolution-ary features pertaining to sequence and structure Parametersthat correspond to a particular site-class are termed site-classspecific whereas parameters that are shared across all site-classes are termed global The use of site-class specific param-eters such as site-class specific amino acid frequencies anddihedral angle diffusion parameters as described in the nextsection is intended to model site-specific physicalndashchemical

properties (Halpern and Bruno 1998 Koshi and Goldstein1998 Lartillot and Philippe 2004)

Amino Acid EvolutionAs is typical with models of sequence evolution amino acidevolution pethAxethiTHORN

a AyethiTHORNb jHi ri

a rib tabTHORN is described by a

Continuous-Time Markov Chain (CTMC) Each amino acidCTMC is parameterized in the following way the exchange-ability of amino acids is described by a 20 20 symmetricglobal exchangeability matrix S (190 free parameters Whelanand Goldman 2001) a site-class specific set of 20 amino acidequilibrium frequencies Ph

r frac14 diagfp1 p2 p20g (19 freeparameters per site-class) and a site-class specific scaling fac-tor Kh

r (one free parameter per site-class) Together theseparameters define a site-class specific time-reversible aminoacid rate matrix Qh

r frac14 Khr SPh

r The stationary distribution ofQh

r is given by the amino acid equilibrium frequencies Phr

Secondary Structure EvolutionSecondary structure evolution pethSxethiTHORN

a SyethiTHORNb jHi ri

a rib tabTHORN is

also described by a CTMC For the sake of simplicity we useonly three discrete classes to describe secondary structure ateach position helix (H) sheet (S) and random coil (C)

The exchangeability of secondary structure classes at a po-sition is described by a 3 3 symmetric global exchangeabilitymatrix V and a site-class specific set of three secondary struc-ture equilibrium frequencies Xh

r frac14 diagfp1 p2 p3g Togetherthey define a site-class specific time-reversible secondary struc-ture rate matrix Rh

r frac14 VXhr with stationary distribution Xh

r

Dihedral Angle EvolutionCentral to our model is evolutionary dependence betweendihedral angles pethXxethiTHORN

a XyethiTHORNb jHi ri

a rib tabTHORN Typically the

continuous-time evolution of the continuous-state randomvariables is modeled by a diffusive process such as the OUprocess as in Challis and Schmidler (2012) However an OUprocess is not appropriate for dihedral angles as they have anatural periodicity For this reason a bivariate diffusion thatcaptures the periodic nature of dihedral angles the WrappedNormal (WN) diffusion was specifically developed for thispaper in Garcıa-Portugues et al (2017)

Topologically the WN diffusion (see fig 2 for a pictorialexample) can be thought of as the analogue of the OU pro-cess on the torus T2 frac14 frac12p pTHORN frac12p pTHORN The WN diffu-sion arises as the wrapping on T

2 of the following Euclideandiffusion

dXt frac14 AXk2Z2

ethl Xt 2kpTHORNwkethXtTHORNz|coefficient

drift

dt thorn R12

z|coefficient

diffusion

dWt

(2)

where Wt is the two-dimensional Wiener process A is thedrift matrix l 2 T

2 is the stationary mean R is the infinites-imal covariance matrix and

Golden et al doi101093molbevmsx137 MBE

2088Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

wkethhTHORN frac141

2A1Rethh lthorn 2kpTHORNPm2Z2

12A1Rethh lthorn 2mpTHORN h 2 T

2 (3)

is a probability density function (pdf) for k 2 Z2 R standsfor the pdf of a bivariate Gaussian Neth0RTHORN The pdf (eq 3)weights the linear drifts of equation (2) such that they be-come smooth and periodic

It is shown in Garcıa-Portugues et al (2017) that the sta-tionary distribution of the WN diffusion is a WNethlRTHORN whichhas pdf

pWNethhjlRTHORN frac14Xk2Z2

Rethh lthorn 2kpTHORN (4)

Despite involving an infinite sum over Z2 taking just thefirst few terms of this sum provides a tractable and accurateapproximation to the stationary density for most of the re-alistic parameter values

Maximum Likelihood Estimation (MLE) for diffusions isbased on the transition probability density (tpd) whichonly has a tractable analytical form for very few specific pro-cesses A highly tractable and accurate approximation to thetpd is given for the WN diffusion This approximation resultsfrom weighting the tpd of the OU process in the same fashionas the linear drifts are weighted in equation (2) yielding thefollowing multimodal pseudo-tpd

~pethh2jh1A lR tTHORN frac14X

m2Z2

pWNethh2jlmt CtTHORNwmethh1THORN (5)

with h1 h2 2 T2 lm

t frac14 lthorn etAethh1 lthorn 2pmTHORN andCt frac14

ETH t0 esAResAT

ds The pseudo-tpd provides a good ap-proximation to the true tpd in key circumstances 1) t 0since it collapses in the Dirac delta 2) t1 since it con-verges to the stationary distribution 3) high concentrationsince the WN diffusion becomes an OU process Furthermoreit is shown in Garcıa-Portugues et al (2017) that the pseudo-tpd has a lower KullbackndashLeibler divergence with respect tothe true tpd than the Euler and Shoji-Ozaki pseudo-tpds formost typical scenarios and discretization times in the diffu-sion trajectory

A further desirable property of the pseudo-tpd is that itobeys the time-reversibility equation which in terms of ethXxethiTHORN

a X

yethiTHORNb THORN is

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

frac14 ~pethXxethiTHORNa jXyethiTHORN

b AlR tabTHORNpWNethXyethiTHORNb jl 1

2 A1RTHORN

Indeed the WN diffusion is the unique time-reversible dif-fusion with constant diffusion coefficient and stationary pdf(eq 4) in the same way the OU is with respect to a GaussianTime-reversibility is an assumption of the overall model andmany other models of sequence evolution A benefit of time-reversibility in a pairwise model such as ETDBN is that one ofthe proteins in a pair may be arbitrarily chosen as the ances-tor thus avoiding a computationally expensive marginaliza-tion of an unobserved ancestor

The likelihood of a dihedral angle observation pairethXxethiTHORN

a XyethiTHORNb THORN assuming that X

xethiTHORNa is drawn from the stationary

distribution is given by

pethXxethiTHORNa X

yethiTHORNb jHi ri

a rib tabTHORN

frac14 pethXxethiTHORNa X

yethiTHORNb jA lR tabTHORN

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

(6)

A and R are constrained to yield a covariance matrixA1R A parameterization that achieves this is R frac14 diagethr2

1 r2

2THORN and A frac14 etha1r1

r2a3

r2

r1a3 a2THORN a1a2 gt a2

3 a1 and a2 arethe drift components for the and w dihedral angles re-spectively Dependence (correlation) between the dihedralangles is captured by a3 A depiction of a WN diffusion withgiven drift and diffusion parameters is shown in figure 2

Site-Classes Constant Evolution and Jump EventsWe now turn to the meaning of the site-class pairs Twomodes of evolution are modeled constant evolution andjump events Constant evolution occurs when the site-classstarting in protein pa at aligned site i ri

a is the same as thesite-class ending in protein pb at aligned site i ri

b that isri

a frac14 rib Thus the distribution over observation pairs at a

site is specified by a single site-classAs already stated a site-class specifies the parameters of

the three conditionally independent stochastic processes de-scribing amino acid dihedral angle and secondary structureevolution A limitation of ldquoconstant evolutionrdquo is that thecoupling between the three stochastic processes is somewhatweak This in part stems from the time-reversibility of thestochastic processesmdashswapping the order of one of the threeobservation pairs at a homologous site eg (Glycine Proline)instead of (Proline Glycine) does not alter the likelihood inequation (1) Alternatively restated from a generative per-spective a lsquodirectional couplingrsquo of an amino acid interchangedoes not inform the direction of change in dihedral angle orsecondary structure For example replacing a glycine in an a-helix in one protein with a proline at the homologous posi-tion in a second protein would be expected to break the a-helix in the second protein and to strongly inform the plau-sible dihedral angle conformations in the second protein

Ideally we would consider a model in which the underlyingsite-classes were not fixed over the evolutionary trajectoryseparating the two proteins as in the case of constant evo-lution as described above but instead were able to lsquoevolversquo intime This would allow occasional switches in the underlyingsite-class at a particular homologous site which would createa stronger dependency between amino acid dihedral angleand secondary structure evolution that furthermore capturesthe directional coupling we desire Such an approach is con-sidered computationally intractable due to the introduction ofcontext-dependence when having to consider neighboring de-pendencies amongst evolutionary trajectories at adjacent sites

In order to approximate this lsquoidealrsquo model in a computa-tionally efficient manner we introduce the notion of a jump

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2089Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 5: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

wkethhTHORN frac141

2A1Rethh lthorn 2kpTHORNPm2Z2

12A1Rethh lthorn 2mpTHORN h 2 T

2 (3)

is a probability density function (pdf) for k 2 Z2 R standsfor the pdf of a bivariate Gaussian Neth0RTHORN The pdf (eq 3)weights the linear drifts of equation (2) such that they be-come smooth and periodic

It is shown in Garcıa-Portugues et al (2017) that the sta-tionary distribution of the WN diffusion is a WNethlRTHORN whichhas pdf

pWNethhjlRTHORN frac14Xk2Z2

Rethh lthorn 2kpTHORN (4)

Despite involving an infinite sum over Z2 taking just thefirst few terms of this sum provides a tractable and accurateapproximation to the stationary density for most of the re-alistic parameter values

Maximum Likelihood Estimation (MLE) for diffusions isbased on the transition probability density (tpd) whichonly has a tractable analytical form for very few specific pro-cesses A highly tractable and accurate approximation to thetpd is given for the WN diffusion This approximation resultsfrom weighting the tpd of the OU process in the same fashionas the linear drifts are weighted in equation (2) yielding thefollowing multimodal pseudo-tpd

~pethh2jh1A lR tTHORN frac14X

m2Z2

pWNethh2jlmt CtTHORNwmethh1THORN (5)

with h1 h2 2 T2 lm

t frac14 lthorn etAethh1 lthorn 2pmTHORN andCt frac14

ETH t0 esAResAT

ds The pseudo-tpd provides a good ap-proximation to the true tpd in key circumstances 1) t 0since it collapses in the Dirac delta 2) t1 since it con-verges to the stationary distribution 3) high concentrationsince the WN diffusion becomes an OU process Furthermoreit is shown in Garcıa-Portugues et al (2017) that the pseudo-tpd has a lower KullbackndashLeibler divergence with respect tothe true tpd than the Euler and Shoji-Ozaki pseudo-tpds formost typical scenarios and discretization times in the diffu-sion trajectory

A further desirable property of the pseudo-tpd is that itobeys the time-reversibility equation which in terms of ethXxethiTHORN

a X

yethiTHORNb THORN is

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

frac14 ~pethXxethiTHORNa jXyethiTHORN

b AlR tabTHORNpWNethXyethiTHORNb jl 1

2 A1RTHORN

Indeed the WN diffusion is the unique time-reversible dif-fusion with constant diffusion coefficient and stationary pdf(eq 4) in the same way the OU is with respect to a GaussianTime-reversibility is an assumption of the overall model andmany other models of sequence evolution A benefit of time-reversibility in a pairwise model such as ETDBN is that one ofthe proteins in a pair may be arbitrarily chosen as the ances-tor thus avoiding a computationally expensive marginaliza-tion of an unobserved ancestor

The likelihood of a dihedral angle observation pairethXxethiTHORN

a XyethiTHORNb THORN assuming that X

xethiTHORNa is drawn from the stationary

distribution is given by

pethXxethiTHORNa X

yethiTHORNb jHi ri

a rib tabTHORN

frac14 pethXxethiTHORNa X

yethiTHORNb jA lR tabTHORN

~pethXyethiTHORNb jX

xethiTHORNa A lR tabTHORNpWNethXxethiTHORN

a jl 12 A1RTHORN

(6)

A and R are constrained to yield a covariance matrixA1R A parameterization that achieves this is R frac14 diagethr2

1 r2

2THORN and A frac14 etha1r1

r2a3

r2

r1a3 a2THORN a1a2 gt a2

3 a1 and a2 arethe drift components for the and w dihedral angles re-spectively Dependence (correlation) between the dihedralangles is captured by a3 A depiction of a WN diffusion withgiven drift and diffusion parameters is shown in figure 2

Site-Classes Constant Evolution and Jump EventsWe now turn to the meaning of the site-class pairs Twomodes of evolution are modeled constant evolution andjump events Constant evolution occurs when the site-classstarting in protein pa at aligned site i ri

a is the same as thesite-class ending in protein pb at aligned site i ri

b that isri

a frac14 rib Thus the distribution over observation pairs at a

site is specified by a single site-classAs already stated a site-class specifies the parameters of

the three conditionally independent stochastic processes de-scribing amino acid dihedral angle and secondary structureevolution A limitation of ldquoconstant evolutionrdquo is that thecoupling between the three stochastic processes is somewhatweak This in part stems from the time-reversibility of thestochastic processesmdashswapping the order of one of the threeobservation pairs at a homologous site eg (Glycine Proline)instead of (Proline Glycine) does not alter the likelihood inequation (1) Alternatively restated from a generative per-spective a lsquodirectional couplingrsquo of an amino acid interchangedoes not inform the direction of change in dihedral angle orsecondary structure For example replacing a glycine in an a-helix in one protein with a proline at the homologous posi-tion in a second protein would be expected to break the a-helix in the second protein and to strongly inform the plau-sible dihedral angle conformations in the second protein

Ideally we would consider a model in which the underlyingsite-classes were not fixed over the evolutionary trajectoryseparating the two proteins as in the case of constant evo-lution as described above but instead were able to lsquoevolversquo intime This would allow occasional switches in the underlyingsite-class at a particular homologous site which would createa stronger dependency between amino acid dihedral angleand secondary structure evolution that furthermore capturesthe directional coupling we desire Such an approach is con-sidered computationally intractable due to the introduction ofcontext-dependence when having to consider neighboring de-pendencies amongst evolutionary trajectories at adjacent sites

In order to approximate this lsquoidealrsquo model in a computa-tionally efficient manner we introduce the notion of a jump

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2089Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 6: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

event A jump event occurs when ria 6frac14 ri

b Whereas constantevolution is intended to capture angular drift (changes indihedral angles localized to a region of the Ramachandranplot) a jump event is intended to create a directional cou-pling between amino acid and structure evolution and is alsoexpected to capture angular shift (large changes in dihedralangles possibly between distant regions of theRamachandran plot)

The hidden state at node Hi together with the evolution-ary time tab separating proteins pa and pb specifies a jointdistribution over a site-class pair

pethria r

ibjHi tabTHORN frac14 pethri

ajHi rib tabTHORNpethri

bjHiTHORN (7)

where

pethriajHi ri

b tabTHORN

frac14ecHi tab thorn pHiri

beth1 ecHi tabTHORN if ri

a frac14 rib

pHiribeth1 ecHi tabTHORN if ri

a 6frac14 rib

8lt

and pethriajHiTHORN frac14 pHiri

aand pethri

bjHiTHORN frac14 pHirib pHiri

aand pHiri

b

are model parameters specifying the probability of starting insite-class ri

a or rib respectively corresponding to the hidden

state specified by node Hi cHi gt 0 is a model parametergiving the jump rate corresponding to the hidden state givenby node Hi

The site-class jump probabilities have been chosen so thattime-reversibility holds in other words

pethriajHi ri

b tabTHORNpethribjHiTHORN frac14 pethri

bjHi ria tabTHORNpethri

ajHiTHORN

The hidden state at node Hi together with a site-class pairethri

a ribTHORN and the evolutionary time tab specifies the joint like-

lihood over site observation pairs

pethOxethiTHORNa O

yethiTHORNb jHi ri

a rib tabTHORN

frac14pethOxethiTHORN

a OyethiTHORNb jHi ri

c tabTHORN if ria frac14 ri

b frac14 ric

pethOxethiTHORNa jHi ri

aTHORNpethOyethiTHORNb jHi ri

bTHORN if ria 6frac14 ri

b

8gtltgt

(8)

In the case of constant evolution evolution at aligned i isdescribed in terms of the same site-class ri

c Evolution is con-sidered constant because each observation type is drawnfrom a single stochastic process specified by Hi and rc Notethat the strength of the evolutionary dependency within anobservation pair is a function of the evolutionary time tab

In the case of a jump event the evolutionary processes areafter the evolutionary jump restarted independently in thestationary distribution of the new site-class Thus the siteobservations O

xethiTHORNa and O

yethiTHORNb are assumed to be drawn from

the stationary distributions of two separate stochastic pro-cesses corresponding to site-classes ri

a and rib respectively

This implies that conditional on a jump the likelihood ofthe observations is no longer dependent on tab A jump eventis an abstraction that captures the end-points of the evolu-tionary process but ignores the potential evolutionary trajec-tory linking the two site observations The advantage ofabstracting the evolutionary trajectory is that there is noneed to perform a computationally expensive marginalizationover all possible trajectories as might be necessary in a modelwhere the hidden states evolve along a tree The likelihood ofan observation pair is now simply a sum over the four possiblesite-class pairs

pethOxethiTHORNOyethiTHORNjHi tabTHORN

frac14Pethri

aribTHORN2R

pethOxethiTHORNOyethiTHORNjHi ria r

ib tabTHORNpethri

a ribjHi tabTHORN

where R frac14 feth1 1THORN eth1 2THORN eth2 1THORN eth2 2THORNg is the set of foursite-class pairs pethOxethiTHORNOyethiTHORNjHi ri

a ribTHORN is given by equation

(8) and pethria r

ibjHi tabTHORN is given by equation (7)

Identification of Evolutionary Motifs Encoding JumpEventsIn order to identify aligned sites having potential evolutionarymotifs encoding jump events a specific criterion wasdeveloped

For a particular protein pair inference was performed un-der the model conditioned on the amino acid sequence anddihedral angles for both proteins ethAaAb Xa XbTHORNHomologous sites corresponding to a single hidden stateand with evidence of a jump event (ri

a 6frac14 rib) at posterior

probabilitygt 090 were identified that is the irsquos such thatpethHi ri

a 6frac14 ribjAaAb Xa XbTHORN gt 090

In a second filtering step amino acid sequences and asingle set of dihedral angles corresponding to one of the

FIG 2 Drift vector field for the WN diffusion with A frac14 eth1 05 0505THORN l frac14 eth0 0THORN and R frac14 eth15THORN2I The color gradient represents theEuclidean norm of the drift The contour lines represent the station-ary distribution An example trajectory starting at x0frac14 (00) and end-ing at x2 running in the time interval [0 2] is depicted using a white tored color gradient indicating the progression of time The periodicnature of the diffusion can be seen by the wrapping of both thestationary diffusion and the trajectory at the boundaries of the squareplane The fact that stationary distribution is not aligned with thehorizontal and vertical axes illustrates the dependence (given by a3)between the u and w dihedral angles

Golden et al doi101093molbevmsx137 MBE

2090Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 7: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

proteins were used (AaAb Xa or AaAb Xb) to infer theposterior probability this time at a lower threshold pethHi ri

a 6frac14ri

bjAaAb XaTHORN gt 050 or pethHi ria 6frac14 ri

bjAaAb XbTHORN gt 050This second criterion ensured that the evolutionary motif wasidentifiable under typical conditions where one has limited ac-cess to structural information (in this case a single protein struc-ture in a pair) Only those aligned sites meeting both criteriawere selected for downstream analysis

Statistical Alignment Modeling Insertions andDeletionsProtein sequences can not only undergo amino transitionsdue to underlying nucleotide mutations in the coding se-quence but also indel events To account this a modifiedpairwise TKF92 alignment HMM based on Miklos et al (2004)was implemented The TKF92 alignment HMM was aug-mented with the ETDBN evolutionary hidden states in orderto capture local sequence and structure evolutionary depen-dencies Furthermore it was modified such that neighboringdependencies amongst hidden states at adjacent alignmentsites were modeled For more details see lsquoStatistical alignmentrsquoin Supplementary Material online

Whilst it is possible to fix the alignment in advance by pre-aligning the sequences using one of the many available align-ment methods (Katoh et al 2002 Edgar 2004) or using acurated alignment (such as from the HOMSTRAD database)doing so ignores alignment uncertainty

Training and Test DatasetsA training dataset of 1200 protein pairs (2400 proteins417870 site observation pairs) and a test dataset of 38 proteinpairs (76 proteins 14125 site observation pairs) were assem-bled from 1032 protein families in the HOMSTRAD databaseFor further details see lsquoConstruction of test and training data-setsrsquo in the supplementary Material Supplementary Materialonline

Model Training and SelectionMaximum Likelihood Estimation of the model parameters Wwas done using Stochastic Expectation Maximization (StEMGilks et al 1995)

For further details of the E- and M-steps of the StEM al-gorithm and for details about model selection please refer tolsquoModel training and selectionrsquo in the supplementary MaterialSupplementary Material online

Results and Discussion

Selecting the Number of Hidden StatesFourteen models were trained (8 16 32 48 52 56 60 64 6872 76 80 96 and 112 hidden state models) The 64 hiddenstate model was chosen as the best model as it had thelowest Bayesian Information Criterion (BIC fig 3)

Stationary Distributions over Dihedral Angles Capturethe Empirical DistributionFigure 4 illustrates the sampled and empirical dihedral angledistributions There is a good correspondence between dihe-dral angles sampled under the model (fig 4 left) and theempirical distribution of dihedral angles in our training data-set (fig 4 right) for all three cases illustrated (all amino acidsglycine only and proline only) The correspondence is notsurprising given that ETDBN is effectively a mixture modelwith a large number of mixture components

Estimates of Evolutionary Time from Dihedral AnglesAre Consistent with Estimates from SequenceWhilst ETDBN has a general scope with respect to applica-tions (including acting as proposal distribution or as a build-ing block in a homology modeling application) we envisionthe primary application being inference of evolutionaryparameters

Figure 5 compares evolutionary times estimated using onlypairs of homologous amino acid sequences versus only pairsof homologous dihedral angles As desired the two estimatesof evolutionary time for each protein pair are similar as canbe seen by the proximity of the points to the identity line

A paired t-test gave a p-value of 0578 thus failing to rejectthe null hypothesis that there is no difference betweenbranch lengths estimated using sequence only versus anglesonly This indicates that there is sufficient evolutionary infor-mation in the dihedral angles to estimate the evolutionarytimes and that the model is consistent in its estimates lackinga significant tendency to underestimate or overestimate theevolutionary times when either sequence or dihedral anglesare used

Interestingly the variance in the sampled evolutionarytimes is higher when dihedral angles only are used as com-pared with sequence only (see fig S1 Supplementary Materialonline)

The Relationship between Evolutionary Time andAngular Distance Is Adequately ModeledWe investigated the relationship between evolutionary timeand angular distance between real protein pairs and proteinpairs where the dihedral angles of pb (Xb) were treated asmissing and hence sampled (fig 6)

As expected for both real and sampled pairs angular dis-tance tends to increase as a function evolutionary time Forlarger evolutionary times a plateau begins to emerge which isexpected as the maximum possible theoretical angular dis-tance is

ffiffiffi8p 2828

When the evolutionary time is exactly zero (tabfrac14 0) underour model the angular distance between sampled dihedralangles is exactly zero (not shown in fig 6) However this is notexpected to be the case for real protein pairs when the twosequences are identical (due to the inherently flexible natureof proteins different experimental conditions experimentalnoise etc) It is therefore not surprising that the regressioncurve for the real protein pairs does not pass through zero

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2091Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 8: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

For small evolutionary times (lt 02) the curves for the realand sampled protein pairs show a good correspondencehowever for larger evolutionary times the model tends tounderestimate angular distances This may reflect the factthat the tpd of the WN diffusion specified is localized aroundits mean even when the evolutionary time is large thereforedihedral angles distant from this mean are unlikely to besampled To a certain extent this is mitigated by the jumpmodel which occasionally allows for large changes in dihedralangle but may still be somewhat limited in its flexibility asjumps can only occur between two site-classes The majorityof protein pairs in our training dataset represent smaller evo-lutionary times (817 of evolutionary times are smaller than04) and therefore protein pairs with larger evolutionary timesand their associated jumps are underrepresented in our data-set which may also explain the underestimation

An additional possibility is that ETDBN does not attemptto model global dependencies Echave and Fernandez (2010)use a LFENM model (which does take into account globaldependencies) and provide evidence showing that the ma-jority of structural changes is due to collective global defor-mations rather than local deformations A local model such asETDBN by definition does not take into account global de-pendencies and therefore does not fully account for theircontribution to structural divergence

Evaluation of the ModelThe conditional independence structure in (eq 1) enablescomputationally efficient sampling from the model underdifferent combinations of observed or missing data For ex-ample ETDBN can be used to sample (ie predict) the dihe-dral angles of a protein from its corresponding amino acidsequence a homologous amino acid sequence a homologousset of dihedral angles the corresponding secondary structurea homologous secondary structure or any combination ofthem

Predictive accuracy was measured using 38 homologousprotein pairs in the test dataset For every protein pair (pa pb)the dihedral angles of pb in each pair were treated as missingand these missing dihedral angles were sampled under themodel given a particular combination of observation typesThe average angular distance between the sampled andknown dihedral angles was used as the measure of predictiveaccuracy

Figure 7 gives an example of predictive accuracy underdifferent combinations of observations types overlaid on acartoon structure of the protein structure being predictedwhereas figure 8 provides a representative view of predictiveaccuracy across 10 different protein pairs in the test datasetfor different combinations of observations types We highlightsome of the key patterns identified in figures 7 and 8 asfollows

Combination 1 refers to random sampling from the modelimplying no data observations were conditioned on besidesthe respective lengths of proteins pa and pb The averageangular distance between the true and predicted dihedralangles was 16 Random sampling acts as a baseline for pre-dictive accuracy It is apparent from figure 7 that the modelhas a propensity to predict right-handed a-helices which isthe most populated region in the Ramachandran plot

Under combination 2 only the amino acid sequence cor-responding to pb is observed As expected in figures 7 and 8there is an increase in predictive accuracy with the addition ofthe amino acid sequence relative to combination 1

Under combination 3 we add in the amino acid sequenceof a homologous protein (pa) In all ten cases there is animprovement in predictive accuracy The improvement inpredictive accuracy is reasonable as knowledge of the se-quence evolutionary trajectory is expected to encode infor-mation about structure evolution and hence will inform thedihedral angle conformational possibilities

Under combination 4 in addition to the two amino acidsequences we treat the homologous secondary structure asobserved This results in a substantial improvement in pre-dictive accuracy as one would expect Knowledge of theamino acid sequence and a homologous secondary structurestrongly informs regions of the Ramachandran plot that arelikely to be occupied

Under combination 5 (which we consider the canonicalcombinationmdashthe standard homology modeling scenario)we treat both amino acid sequences as observed as well asthe dihedral angles of the homologous protein (pa)mdashin allcases the predictive accuracy improves over combination 4This is anticipated as the homologous dihedral angles areexpected to be the best proxy for missing dihedral anglesand are therefore expected to be more informative than sec-ondary structure alone Note that the availability of a homol-ogous amino acid sequence pair here and in combination 4 isconsequential as it informs the evolutionary time tab param-eter which will typically constrain the distribution over dihe-dral angles and reduce the associated uncertainty

Finally in combination 6 the same data observationsas in combination 5 are used except the alignment istreated as given a priori (by the HOMSTRAD alignment)

FIG 3 BIC scores (points) and number of free parameters (curve) as afunction of the number of hidden states across 14 models (indicatedabove the dotted vertical lines) trained using the 1200 protein pairs inthe training dataset A 64 hidden state model had the lowest BICscore Each model represents the best of several attempts

Golden et al doi101093molbevmsx137 MBE

2092Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 9: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

rather than as unobserved The HOMSTRAD alignment isbased on a structural and sequence alignment of pa andpb and therefore is expected to encode a higher degree ofhomology and structural information than combination 5(where the alignment is treated as unobserved and

therefore a marginalization over alignments is per-formed) On average there is a slight improvement inpredictive accuracy when fixing the alignment albeitthe magnitude of improvement is not substantial Thisdemonstrates the accuracy of the alignment HMM

FIG 4 Ramachandran plots depicting sampled and empirical dihedral angle distributions The top row depicts the distributions for all amino acidsthe middle for glycine only and the bottom for proline only The leftmost plots show dihedral angles sampled under the jump model whereas therightmost plots show the empirical distributions of dihedral angles in the training dataset

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2093Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 10: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

The alignment HMM accounts for alignment uncertaintyin a principled manner which is particularly useful when anappropriate alignment is unavailable However it should benoted that inference scales Oethjpajjpbjh2THORN when treating thealignment as unobserved Inference scalesOethmh2THORN when thealignment is fixed a priori where m is the length of alignmentMab and is typically much smaller than jpajjpbj

It should be emphasized that we do not expect ETDBN tocompete with structure prediction packages such as Rosetta(Rohl et al 2004) or homology modeling software such asArnold et al (2006) in terms of predictive accuracy Our cur-rent model is a local model of structure evolutionmdashit is noteven expected capture fundamental constraints such as theradius of gyration of a protein or other global features typicalof proteins

Evolutionary Hidden States Reveal a CommonEvolutionary MotifOne benefit of ETDBN is that the 64 evolutionary hiddenstates learned during the training phase are interpretableWe give an example of a hidden state encoding a jump eventthat was subsequently found to represent an evolutionarymotif present in a large number of protein pairs in our testand training datasets

Evolutionary hidden state 3 (fig 9) was selected from the64 hidden states as an example of a hidden state encoding ajump event and capturing angular shift (a large change indihedral angle) A notable feature of this hidden state is thatthe change in dihedral angles between site-classes r1 and r2 isassociated with specific amino acid changes In site-class r1 the

amino acid frequencies are relatively spread out amongst anumber of amino acids whereas in site-class r2 the frequen-cies are particularly concentrated in favor of glycine (Gly) andasparagine (Asp) with glycine being significantly more prob-able in site-class r2 than r1 This suggests that conditioned onhidden state 3 an exchange between a glycine to anotheramino acid is likely indicative of a jump and hence a corre-sponding change in dihedral angle This is consistent withwhat we find in a subsequent analysis of evolutionary motifsThis particular jump occurs in coil regions

Having selected hidden state 3 positions in 238 proteinpairs were analyzed for evidence of the corresponding evolu-tionary motif 38 protein pairs in the test dataset and a further200 from the training dataset were analyzed using the criteriadescribed in the Methods section Using the first criterion 84protein sites in 59 protein pairs corresponding to Hi frac14 3(evolutionary hidden state 3) were identified Of the 84 pro-tein sites 34 protein sites met the second criterion

We give an example of a homologous protein pair illus-trating the identified evolutionary motif Two histidine-containing phosphocarriers 1pch (Mycoplasma capricolum)and 1poh (Escherichia coli) were identified as having the evo-lutionary motif (fig 10) at homologous site E39G39

FIG 5 Scatterplot comparing evolutionary times estimated usingpairs of homologous amino acid sequences only versus pairs of ho-mologous sets of dihedral angles only for Nfrac14 38 proteins pairs in thetest dataset The x-coordinate of each point gives the estimated evo-lutionary time based only on the amino acid sequence whereas they-coordinate gives the estimated evolutionary time based only on thedihedral angles The diagonal line represents yfrac14 x

FIG 6 Evolutionary time versus angular distances between real andcorresponding sampled proteins pairs in the training dataset at 50representative evolutionary times Mean angular distances (seelsquoCalculation of angular distancesrsquo in the supplementary MaterialSupplementary Material online) between the real dihedral angles ina protein pair (red Xa and Xb) and sampled dihedral angles in asampled protein pair (blue Xa and Xb) were compared with testhow well the sampled dihedral angles reproduced the real angulardistances The dihedral angles (Xb) of each sampled protein pair weresampled by conditioning on both amino acid sequences and thehomologous dihedral angles (Aa Ab Xa) and the estimated evolu-tionary time (tab) for the real protein pair The regression curves wereobtained by a quadratic LOcally-weighted regrESSion (LOESS) withsmoothing parameter chosen by leave-one-out cross-validation The95 confidence intervals for the mean assume error normality

Golden et al doi101093molbevmsx137 MBE

2094Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 11: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

Most positions in the homologous pair have low posteriorjump probabilities ( 00) with the exception of positionsN38N38 and E39G39 which both have high posteriorjump probabilities ( 10) The exchange between a gluta-mate (at position 39 in 1poh) and a glycine (at position 39in 1pch) appears to be responsible for the shift in dihedralangle This exchange corresponds to a significant jump in di-hedral angle h1pohE39 w1pchE39i frac14 h163006i h1pohG39w1pchG39i frac14 h140 022i The angular dis-tance between the two dihedral angles is 201 This isconsistent with the amino acid frequency parametersspecified by the two site-classes for hidden state 3 (fig 9)

Site-class r1 indicates that a number of amino acids (alanineaspartic acid glycine histidine lysine asparagine proline glu-tamine arginine serine and theorine) other than glutamateplausibly coincide with the particular dihedral angle confor-mation specified by site-class r1 The involvement of glycine ina jump is not surprising as it is a small and flexible amino acidwhereas the role of asparagine is less clear In our analysis of238 protein pairs we found that of the seven positions meetingthe criteria for hidden state 3 and involving an exchange withasparagine (Asn) four were an exchange between an

asparagine and a glycine whereas the remaining three werebetween asparagine and one of lysine histidine or serine

Using Dihedral Angles for AlignmentA valuable feature of our model is its ability to account foralignment uncertainty by summing over possible pairwisealignments using the TKF92 model as a prior distributionover indel histories whilst simultaneously taking into accountneighboring dependencies amongst aligned sites Doing soresults in a sample of alignments rather than a single align-ment Nevertheless a single Maximum A Posteriori (MAP)pairwise alignment may be obtained from the alignment sam-ples and used for downstream analysis

ETDBN and several other alignment methods (namelyStatAlign BAli-Phy MUSCLE and MAFFT) were used to inferpairwise alignments from simulated and real data under var-ious combinations of data observations for example anamino acid sequence pair (Aa Ab) a secondary structure se-quence pair (Sa Sb) a dihedral angle sequence pair (Xa Xb)and combinations thereof

In the first set of benchmarks (fig 11A) pairs of proteinswere simulated from the ETDBN model conditioned on 38

FIG 7 Cartoon structure representations of E coli glyceraldehyde-3-phosphate dehydrogenase structure (PDB 1gad) are depicted in each paneloverlaid with predictive accuracy when using different combinations of observed data to predict missing dihedral angles in 1gad Thermusaquaticus glyceraldehyde-3-phosphate dehydrogenase (PDB 1cer) was used as a homolog for the purposes of prediction Predictive accuracy isindicated using a color gradient depicting the mean angular distance between the true dihedral angle (Xi

1gad) and the predicted (sampled) dihedralangles (X

i

1gad) at each amino acid position The label at the bottom of each panel indicates the data combination used In (A) no data was used forprediction In (B) only the amino acid sequence corresponding to 1gad (A1gad) was used In (C) the amino acid sequence of 1gad (A1gad) and theamino acid sequence of the homologous protein (A1cer) were used In (D) both amino acids sequences (A1cer and A1gad) and the secondarystructure of the homologous protein (S1cer) were used In (E) both the amino acid sequences (A1cer and A1gad) and the dihedral angles of thehomologous protein (X1cer) were used Finally in panel (F) the same combination of observations was used as in (E) but the alignment was treatedas known a priori

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2095Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 12: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

FIG 8 Benchmarks of predictive accuracy (measured using angular distance lower is better) on a random subset of ten protein pairs in the testdataset giving a representative view of predictive accuracy under six different combinations of observations The dihedral angles Xb of pb weretreated as missing and were sampled under the model whereas pa was a homologous protein used for the purposes of prediction See the legend offigure 7 for a description of each combination (AndashF) The final set of bars denoted lsquoMean (Nfrac14 38)rsquo are the mean values for the entire test dataset ofNfrac14 38 protein pairs The error bars are the standard errors

FIG 9 Depiction of evolutionary hidden state 3 This hidden state was sampled at 069 of sites (the average was 156) The equilibriumfrequencies of r1 and r2 were p1 frac14 0691 and p2 frac14 0309 respectively The jump rate was cfrac14 3176 The corresponding site-class pair probabilitiesare depicted to the right as a function of evolutionary time Note that the dashed red lines depicting the probabilities for (1 2) and (2 1)superimpose exactly because the probabilities are equalmdashthis holds for the jump probabilities of all hidden states as it is required for time-reversibility In the main figure the two rows depict the parameters encoded by the two site-classes respectively Columns 1 and 3 depict theparameters governing the amino acid and secondary structure stochastic processes respectively The secondary structure classes correspond toHfrac14 helix Sfrac14 sheet and Cfrac14 coil Column 2 depicts the WN diffusions The stationary distributions of the WN diffusions are shown using blackcontour lines the direction of the drifts are indicated by the arrows and the magnitude of the drifts at each position indicated using the colorgradient

Golden et al doi101093molbevmsx137 MBE

2096Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 13: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

different pairwise alignments and corresponding evolutionarytimes This resulted in a set of 38 simulated pairwise align-ments together with corresponding observations implyingthat the true underlying alignments were known for eachof the simulated protein pairs ETDBN and a number otheralignment methods were used to infer pairwise alignmentsfor each The alignment similarity metric (Schwartz et al2005) was used to measure the similarity between the inferredalignments and the true alignments where higher similarityindicates better predictions It was found that when using thesimulated amino acid sequences alone ETDBN (11A5) out-performed all four other methods tested (11A1 MUSCLE11A2 MAFFT 11A3 StatAlign and 11A4 BAli-Phy)However the greater performance of ETDBN compared withother methods cannot be considered a fair comparison as thedata were simulated under the ETDBN model

More revealing in figure 11A was the alignment similar-ity under ETDBN when using different combinations ofsimulated data observations It was found that secondarystructure alone (11A6) performed the worst which is un-surprising given that only three states were available toalign the proteins The second worst in terms of alignmentsimilarity was amino acid sequences alone (11A5) fol-lowed by amino acid sequences and secondary structures(11A7) Interestingly using dihedral angles only (11A8)outperformed both 11A5 (sequences only) and 11A6 (sec-ondary structures only) Finally using amino acid sequencetogether with dihedral angles (11A9) or all three datatypes combined (11A10) outperformed all other combi-nations This illustrates that at least under simulation

conditions increasing the number of data observationsresults in better alignment accuracy

Following that the various alignment methods werebenchmarked against 38 pairwise alignments consisting ofreal sequence and structure observations in the test datasetThese pairwise alignments were obtained from theHOMSTRAD alignments The sequence identity of these pair-wise alignments ranged from 10 to 93 with an averagesequence identity of 39 In addition to methods 1ndash10 infigure 11A five structural alignment methods were also used11 StatAlign (Herman et al 2014) 12 TMAlign (Zhang andSkolnick 2005) 13 Mammoth (Ortiz et al 2002) 14 Dali(Holm and Rosenstrom 2010) and 15 CE (Shindyalov andBourne 1998) These methods were not used in 11A due tothe lack of an appropriate model for simulating the evolutionof three-dimensional protein structures

When benchmarking the MAP estimated alignmentsagainst the HOMSTRAD alignments (fig 11B) using real se-quences alone for inference (Aa Ab) ETDBN (11B5) had asimilar degree of accuracy when compared with several othersequence-based methods (11B1 StatAlign 11B2 BaliPhy11B3 MUSCLE and 11B4 MAFFT) This demonstrates thatETDBN has performance comparable to that of othercommonly-used sequence alignment methods

Using (Sa Sb) alone ETDBN (11B6) had substantially loweralignment similarity compared with sequence only whichwas expected given that a similar result was obtained forthe simulated data (11A6) However when including thereal sequences (11B7) the predictive accuracy was once againcomparable to sequence only inferences (11B1ndash11B5)

FIG 10 Depiction of two histidine-containing phosphocarriers PDB 1pch and 1poh superimposed On the left is a cartoon representation of thetwo proteins corresponding to regions F29-K49 and F29-A42 respectively with posterior jump probabilities at each position overlaid On the rightis a ball-and-stick representation giving atomic detail for a smaller region (I36-G42 and T36-A42 respectively) The exchange between a glutamate(E39 in 1poh) and a glycine (G39 in 1pch) is associated with a large change in dihedral angle as indicated by the curved arrows

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2097Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 14: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

When using (Xa Xb) alone (11B8) the alignment similaritywas found to be somewhat worse than the sequence onlycases Furthermore when introducing the sequences (119)and secondary structures (11B10) in addition to the dihedralangles the similarity remained worse than the sequence onlymethods (11B1ndash11B5) despite the additional informationThese results are in contrast to the results we obtained forsimulated data (11A8ndash11A10)

The non-statistical structural alignment methods (11B12ndash11B15) faired the best likely because they use a criteria similarto that used to align the HOMSTRAD alignments Wheninterpreting these results it is important to note theHOMSTRAD alignments should not be considered the trueunderlying alignments and may even be strongly biased Forexample they may favor the closest structural superimposi-tion of structures or the most parsimonious alignments withthe fewest number of indels In the evolutionary modelingcontext our goal is to distinguish between homologous sites(sites that have evolved via mutation alone) and indels Inpractice it is extremely difficult to obtain the true underlyingalignment (sets of homologies and indels) because it wouldrequire an experiment where every indel event since thecommon ancestor is observed a seeming impossible taskoutside of simulation or laboratory conditions

After further investigation the trend of lower alignmentsimilarity seen in figure 11B6ndash11B10 when using ETDBN withstructural observations compared with sequence only or non-statistical structural alignment methods was found to reverse(11C6ndash11C10) upon calculating the precision of predictinghomologous sites (the fraction of sites which were predictedas homologous and were correctly predicted as such)Therefore when only dihedral angle observations are usedETDBN underpredicts the number of homologous sites how-ever when a homologous site is predicted it is correctly pre-dicted more often than when using only amino acidsequences In particular ETDBN predicted fewer homologoussites with coiled secondary structure compared with homol-ogous sites with helical or sheet secondary structure Thispattern of results may be in part due to the WN diffusionused to model evolution of dihedral angles The WN diffusionis suitable for modeling angular drift (small changes in angleslocalized around a region of the Ramachandran plot) butdoes not sufficiently capture angular shift (large changes inangles between regions of the Ramachandran plot which aremore likely in coiled regions) due to stationarity As notedbefore the jump model is an abstraction intended to capturethe end-points of evolution by allowing a jump between tworegions of the Ramachandran plot abstracting a potentialintermediate evolutionary trajectory for the sake of compu-tational tractability Note that the jump model accuratelycaptures the common cases where a single mutation inducesa large conformational shift

Concluding RemarksThe main achievement of this work is a computationallytractable generative and interpretable probabilistic modelof protein sequence and structure evolution on a local scale

A

B

C

FIG 11 Alignment benchmarks (A) averages of alignment similarityacross different methods and combinations of data observations wherethe observations were simulated from the ETDBN model conditioned on asingle known alignment (B) averages of alignment similarity where the 38HOMSTRAD alignments in the test dataset were taken as the true align-ments (C) averages of precision in predicting homologous site pairs in 38sequence pairs from the test dataset where homologous site pairs in theHOMSTRAD alignments were taken to be the true homologous site pairs

Golden et al doi101093molbevmsx137 MBE

2098Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 15: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

Previous stochastic models of protein sequence and struc-ture evolution emphasized estimation of evolutionary param-eters (Challis and Schmidler 2012 Herman et al 2014)ETDBN is somewhat of a departure from these previousmodels but is likewise capable of estimating evolutionaryparameters We show that estimates of evolutionary timesinferred under ETDBN are consistent regardless of whetheramino acid sequence or dihedral angle observations are usedIn addition the relationship between evolutionary time andangular distance in real proteins is adequately recapitulated inprotein pairs sampled under the model albeit the angulardistance is underestimated for larger evolutionary timeswhich might be explained by the limited flexibility of thejump model and the lack of taking into account globaldependencies

Like previous models ETDBN is capable of dealing withalignment uncertainty by marginalizing over indel histories itpredicts pairwise MAP consensus alignments with accuracysimilar to that of score-based and statistical alignmentmethods

The generative nature of ETDBN allows us to demonstratethat the underlying empirical distributions over dihedral an-gles (depicted using Ramachandran plots) are captured andthat the model is capable of predicting missing observationssuch as dihedral angles from a variety of different data typesFor example an amino acid sequence a homologous aminoacid sequence a homologous secondary structure a homol-ogous set of dihedral angles or any combination thereof

Based on its local nature ETDBN does not constitute ahomology modeling method in itself Rather it can be used asa building block much like fragment libraries model localstructure in protein structure prediction methods ETDBNplaces the homology modeling problem on a statistical foot-ing enabling a number of approaches to later be used such asmulti-level modeling that is combining fine-grained distribu-tions (for example distributions over dihedral angles such asETDBN) and coarse-grained distributions (for example distri-butions describing the global properties of proteins such ascompactness) In particular a method referred to as lsquotheReference Ratio methodrsquo can be used to combine fine-grained and coarse-grained distributions in a statistically prin-cipled manner (see Frellsen et al 2012 Hamelryck et al 2010)However we have shown that the current model can alreadybe used for the inference of evolutionary parameters in itspresent form

In addition to multi-level modeling probabilistic modelssuch as ETDBN allow one to account for and to make state-ments about uncertainty (eg with respect to evolutionarytime alignment etc) in a rigorous manner In principleETDBN like TorusDBN (Boomsma et al 2008) could beused as a proposal distribution In other words ETDBN couldbe used to sample protein structures (possibly conditionedon various data observations) in a computationally efficientmanner such that the resulting samples are expected to belocated in regions of high probability density with respect tothe true underlying distribution

A final key feature of our evolutionary model is its inter-pretable nature This interpretability enables the

identification of potential evolutionary motifsmdashcommonpatterns of sequencendashstructure evolution We identify onesuch evolutionary motif in 34 different homologous proteinpairs A major direction for future research is the furtheridentification of such evolutionary motifs Understandingthese evolutionary motifs may 1) improve homology mod-eling predictions 2) provide more accurate estimates of evo-lutionary parameters and 3) produce better models ofprotein evolution that more realistically capture evolutionarytrajectories through sequence and structure space whichmay help identify functionally relevant positions that are po-tential drug targets

Future Challenges

Pairwise to PhylogenyFor reasons of computational tractability the implementedmodel is pairwise but it is theoretically possible to generalizeit to a phylogeny such as in Herman et al (2014) In practicefor three or more sequences on a phylogeny it is necessary tomarginalize out the unobserved ancestral protein states inorder to compute likelihoods Felsensteinrsquos algorithm canbe used to marginalize over discrete ancestral states suchas amino acids in a computationally efficient mannerHowever we do not know whether a similar efficient algo-rithm exists for marginalizing the continuous ancestral dihe-dral angle states under the WN diffusion therebynecessitating a more expensive MCMC algorithm A possiblygreater computational hindrance to considering a phylogenyis the alignment problem which scalesOethl1 l2 lNTHORN where li is the length of sequence iand N is the number of sequences although MCMCapproaches are possible (Herman et al 2014)

Context-DependenceAlthough we believe our model provides a substantial im-provement over current stochastic models of sequence andstructural evolution there is still scope for improvement TheWN diffusions used to model dihedral angle evolution ade-quately capture angular drift (small local changes in dihedralangle) but are less capable of capturing angular shift (largechanges in dihedral angle) This is to a considerable extentmitigated by the introduction of jump events as discussedbefore A more realistic model would model the entire evo-lutionary trajectory allowing an arbitrary number of switchesbetween site-classes together with neighboring dependenciesamongst adjacent sites along the evolutionary trajectorySimilar context-dependent models are typically computa-tionally expensive and require sophisticated inference proce-dures (Robinson et al 2003 Yu and Thorne 2006)

Software AvailabilityJulia code (tested on both Windows and Linux platforms) isavailable at httpwwwcomputingforbiologyorgsoftwareetdbn

Evolutionary TorusDBN doi101093molbevmsx137 MBE

2099Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017

Page 16: A Generative Angular Model of Protein Structure Evolutioneprints.whiterose.ac.uk/123598/7/msx137.pdf · motif present in a large number of homologous protein pairs. The generative

Supplementary MaterialSupplementary material are available at Molecular Biology andEvolution online

AcknowledgmentsThe authors acknowledge funding from the Universityof Copenhagen 2016 Excellence Programme forInterdisciplinary Research (UCPH2016-DSIN) The second au-thor acknowledges support from project MTM2016-76969-Pfrom the Spanish Ministry of Economy and Competitivenessand ERDF Authors acknowledge valuable comments fromthree referees that led to substantial improvements in themanuscript as well as initial discussions with Christian HavnIan Lim and Mathias Cronjeuroager

ReferencesArnold K Bordoli L Kopp J Schwede T 2006 The SWISS-MODEL work-

space a web-based environment for protein structure homologymodelling Bioinformatics 22(2)195ndash201

Boomsma W Mardia KV Taylor CC Ferkinghoff-Borg J Krogh AHamelryck T 2008 A generative probabilistic model of local proteinstructure Proc Natl Acad Sci U S A 105(26)8932ndash8937

Boomsma W Tian P Frellsen J Ferkinghoff-Borg J Hamelryck T Lindorff-Larsen K Vendruscolo M 2014 Equilibrium simulations of proteinsusing molecular fragment replacement and NMR chemical shiftsProc Natl Acad Sci U S A 111(38)13852ndash13857

Challis CJ Schmidler SC 2012 A stochastic evolutionary model for pro-tein structure alignment and phylogeny Mol Biol Evol 29(11)3575ndash3587

Echave J 2008 Evolutionary divergence of protein structure the linearlyforced elastic network model Chem Phys Lett 457(4)413ndash416

Echave J Fernandez FM 2010 A perturbative view of protein structuralvariation Proteins Struct Funct Bioinform 78(1)173ndash180

Edgar RC 2004 MUSCLE multiple sequence alignment with high accu-racy and high throughput Nucleic Acids Res 32(5)1792ndash1797

Felsenstein J 1985 Phylogenies and the comparative method Am Nat125(1)1ndash15

Frellsen J Mardia KV Borg M Ferkinghoff-Borg J Hamelryck T 2012Towards a general probabilistic model of protein structure the ref-erence ratio method In Bayesian methods in structural bioinfor-matics New York Springer p 125ndash134

Garcıa-Portugues E Soslashrensen M Mardia KV Hamelryck T 2017Langevin diffusions on the torus estimation and applicationsarXiv170500296

Gilks WR Richardson S Spiegelhalter D 1995 Markov chain MonteCarlo in practice London CRC Press

Grishin NV 1997 Estimation of evolutionary distances from proteinspatial structures J Mol Evol 45(4)359ndash369

Grishin NV 2001 Fold change in evolution of protein structures J StructBiol 134(2)167ndash185

Gutin AM Badretdinov AY 1994 Evolution of protein 3D structures asdiffusion in multidimensional conformational space J Mol Evol39(2)206ndash209

Halpern AL Bruno WJ 1998 Evolutionary distances for protein-codingsequences modeling site-specific residue frequencies Mol Biol Evol15(7)910ndash917

Hamelryck T Borg M Paluszewski M Paulsen J Frellsen JAndreetta C Boomsma W Bottaro S Ferkinghoff-Borg J2010 Potentials of mean force for protein structure predic-tion vindicated formalized and generalized PLoS ONE5(11)e13714

Herman JL Challis CJ Novak A Hein J Schmidler SC 2014 SimultaneousBayesian estimation of alignment and phylogeny under a jointmodel of protein sequence and structure Mol Biol Evol31(9)2251ndash2266

Holm L Rosenstrom P 2010 Dali server conservation mapping in 3DNucleic Acids Res 38(Web Server issue)W545ndashW549

Katoh K et al 2002 MAFFT a novel method for rapid multiple sequencealignment based on fast Fourier transform Nucleic Acids Res30(14)3059ndash3066

Koshi JM Goldstein RA 1998 Models of natural mutations including siteheterogeneity Proteins 32(3)289ndash295

Lartillot N Philippe H 2004 A Bayesian mixture model for across-siteheterogeneities in the amino-acid replacement process Mol BiolEvol 21(6)1095ndash1109

Lio P Goldman N Thorne JL Jones DT 1998 PASSML combining evo-lutionary inference and protein secondary structure predictionBioinformatics 14(8)726ndash733

Miklos I Lunter G Holmes I 2004 A long indel model for evolutionarysequence alignment Mol Biol Evol 21(3)529ndash540

Mizuguchi K Deane CM Blundell TL Overington JP 1998 HOMSTRADa database of protein structure alignments for homologous familiesProtein Sci 7(11)2469ndash2471

Ortiz AR Strauss CE Olmea O 2002 Mammoth (matching molecularmodels obtained from theory) an automated method for modelcomparison Protein Sci 11(11)2606ndash2621

Robinson DM Jones DT Kishino H Goldman N Thorne JL 2003 Proteinevolution with dependence among codons due to tertiary structureMol Biol Evol 20(10)1692ndash1704

Rohl CA Strauss CE Misura KM Baker D 2004 Protein structure pre-diction using Rosetta Methods Enzymol 38366ndash93

Schwartz AS Myers EW Pachter L 2005 Alignment metric accuracyarXiv preprint q-bio0510052

Shindyalov IN Bourne PE 1998 Protein structure alignment by incre-mental combinatorial extension (CE) of the optimal path ProteinEng 11(9)739ndash747

Siepel A Haussler D 2004 Combining phylogenetic and hiddenMarkov models in biosequence analysis J Comput Biol11(2ndash3)413ndash428

Thorne JL Kishino H Felsenstein J 1992 Inching toward reality an im-proved likelihood model of sequence evolution J Mol Evol34(1)3ndash16

Whelan S Goldman N 2001 A general empirical model of proteinevolution derived from multiple protein families using amaximum-likelihood approach Mol Biol Evol 18(5)691ndash699

Yu J Thorne JL 2006 Dependence among sites in RNA evolution MolBiol Evol 23(8)1525ndash1537

Zhang Y Skolnick J 2005 TM-align a protein structure alignmentalgorithm based on the TM-score Nucleic Acids Res33(7)2302ndash2309

Golden et al doi101093molbevmsx137 MBE

2100Downloaded from httpsacademicoupcommbearticle-abstract34820853772139by Edward Boyle Library useron 08 November 2017