Download - Spatial Autocorrelation of Amino Acid Replacement Rates in the Vasopressin Receptor Family

Spatial Autocorrelation of Amino Acid Replacement Ratesin the Vasopressin Receptor Family

Lorraine Marsh

Received: 3 June 2008 / Accepted: 11 November 2008 / Published online: 4 December 2008

� Springer Science+Business Media, LLC 2008

Abstract Evolutionary rates of sites can be independent

of one another or correlated in some fashion. Significant

spatial autocorrelation was observed for site amino acid

replacement rates in vasopressin receptor family proteins

(VPRs). Spatial autocorrelation of rates is the propensity of

residues to lie near other residues of similar rate in the

folded protein structure. Optimal correlation occurred at a

distance suggesting that residues in contact had correlated

rates. As another way to study the same phenomenon, VPR

was partitioned into [40 9 10 A3 contiguous spatial

clusters for amino acid replacement rate estimation. Parti-

tioning was done without preconception of functional

regions of the protein and with a random partition control.

Cluster rates exhibited an overdispersed distribution sug-

gesting that rates were not randomly distributed in the

spatial partitions. In tests, cluster partitioning improved

maximum likelihood and Bayesian likelihood models for

VPR evolution. Spatial clusters with outlier rates, or line-

age-specific clusters differing in rate, proved to contain

VPR features likely to be under selection. Thus the spatial

autocorrelation observed is probably not just a statistical

finding, but likely has an evolutionary basis in protein

function.

Keywords Rate variation � Autocorrelation � Clustering �Vasopressin receptor � Bayesian phylogenetic inference �Gamma rate distribution

Introduction

The study of amino acid replacement rate variation allows

better prediction of rates, may improve phylogenetic infer-

ence, and gives insight into selection. There are many

sources of amino acid replacement rate variation among

sites. A number of sophisticated models have been proposed

to correlate site rate with tertiary structure of a protein (Dean

et al. 2002; Choi et al. 2007; Robinson et al. 2003; Marsh

and Griffiths 2005). Many of these are based on secondary

structure, solvent accessibility, and functionality of each site.

In some models protein domains are allowed to evolve at

independent rates (Van Damme et al. 2007). A general

perspective has been that residues that are constrained by

functional or by structural roles are less free to evolve and,

hence, exhibit a lower rate of change. Such roles are typically

spatially limited in proteins. However, because of the folded

structure of proteins, sites that make up a single spatial

domain may not be contiguous in the primary sequence.

A variety of approaches to integrating structural data

into evolutionary models has been proposed. One class of

approaches involves models assigning amino acid

replacement rates to specific classes of structural sites.

Examples are sites located at the surface of the protein or

part of a specific secondary structure or part of a functional

site (Dean et al. 2002; Choi et al. 2006). Methods for

evolutionary analysis based on the folded structure of

proteins have, however, found limited use, in part because

of their complexity and the lack of programs that accept

these analyses as input. However, the advent of major

efforts to solve protein structures provides a widening set

of X-ray crystallographic templates which have great

potential for evolutionary studies.

An alternative approach to site variation is to ignore

protein and DNA origins of rate differences and, instead, to

L. Marsh (&)

Department of Biology, Long Island University, Brooklyn,

NY 11201, USA

e-mail: [email protected]

123

J Mol Evol (2009) 68:28–39

DOI 10.1007/s00239-008-9183-4

look at variation as a purely statistical problem. A number

of different functions have been used to fit site variation,

but the most common is the discretized gamma rate dis-

tribution (Yang 1994). This distribution takes parameters

often fit to the data. Despite the convenience, the gamma

rate distribution often underfits the observed data, and the

underlying problem of the source of rate variation remains.

The gamma rate distribution is essentially a curve-fitting

process which treats residue change as a black box. Though

it models the amino acid replacement rate variation and

significantly accommodates many patterns of rate varia-

tion, it does not attempt to capture the underlying biology

of amino acid replacement rate variation (Felsenstein

2001). Other models have been proposed to better fit rate

distributions (Ninio et al. 2007; Huelsenbeck and Suchard

2007). For example, a gamma mixture model fits many

proteins better than the single gamma model (Mayrose

et al. 2005). Some rate distributions are multimodal, for

instance, and poorly fit by the gamma rate distribution but

fit well by mixtures. Discrete models with higher numbers

of parameters that attempt to fit data better have also been

proposed (Susko et al. 2003). These are improvements on

the curve-fitting properties of the distribution but do not

address the biological basis of amino acid replacement rate

variation.

An approach to incorporating the concept of structure

implicitly is to consider that adjacent residues in the linear

sequence are spatially adjacent as well. Linear autocorre-

lation in primary protein sequence describes the tendency

of residues of like function to lie near each other in the

linear sequence and to share amino acid replacement rates

(Stern and Pupko 2006; Chakrabarti and Lanczycki 2007).

Linear autocorrelation of rates can be due to DNA features

(Elango et al. 2008) or protein structure (Mayrose et al.

2007). Linear autocorrelation provides a model to predict

amino acid replacement rate based on sequence position.

However, not all residues adjacent in the folded protein

structure are adjacent in the linear sequence.

The use of structure-based evolutionary models is lim-

ited by the requirement that the protein structure must be

known. Fortunately, protein structures form families and

related proteins share similar structures or folds. For pro-

teins sharing folds a model of a protein of interest can be

inferred using the structure of related (homologous) pro-

tein. Homology modeling is a method in which a protein of

unknown structure is modeled on a template protein of

known structure. Though homology-modeled proteins may

be only approximate at the atomic level, they tend to be

accurate in topology as long as the correct template is

chosen and alignment is correct (Fiser and Sali 2003). For

instance, the grouping of residues in the vicinity of active

sites is generally valid in properly prepared homology

models even if the model and structure differ in detail

(Chakrabarti and Sowdhamini 2004). For many evolu-

tionary purposes the accuracy of homology models may be

more than adequate and may extend the utility of structural

methods to most proteins (Marsh and Griffiths 2005).

The vasopressin receptor (VPR) is a nonapeptide

receptor of the G-protein-coupled receptor family (GPCR),

related to rhodopsin. The VPR is interesting because it

contains several distinct functional regions: peptide ligand

binding, G-protein binding, core protein stability, and

receptor switch domain. Residues that function together are

predicted to be scattered in the primary sequence of VPRs.

In addition, the VPR has a paralogue, the oxytocin receptor

(OR). The VPR and OR exhibit some sensitivity to each

others ligands (vasopressin and oxytocin differ at only two

positions). The VPR acts in blood pressure and fluid

homeostasis, whereas the OR mediates uterine contractions

in labor. Evolutionary selection may differ in these

receptors of different function, providing opportunities to

study paralogue specificity.

Here we describe approaches for analyzing amino acid

replacement rate variation based on spatial autocorrelation.

One method involved correlation of rates in space. The

other method was based on the determination of rates in

three-dimensional (3D) spatial clusters of residues set as

partitions. A variety of tests was used to show that amino

acid replacement rates of residues in these clusters are

correlated and significant to evolutionary analyses. The

method allowed identification of clusters of residues of

VPRs that might share a common selection.

Methods

Sequence Sampling

Sequences were retrieved from the National Center for

Biotechnology Information (NCBI). Diverse vertebrate

vasopressin receptor homologues and paralogues were

selected (Table 1). Octopus octopressin receptor was cho-

sen as an outgroup. Multiple alignment of sequences was

performed using Clustal W (Thomson et al. 1994) except

for the alignment with rhodopsin, which was performed as

described below. Clustal W settings with a gap opening

penalty of 10, a gap extension penalty of 0.2, and a Gonnet

series substitution matrix. Structures were retrieved from

the Protein Data Bank (PDB) (Berman et al. 2000). The

sequence from the rhodopsin structure was taken from the

appropriate PDB file.

Homology Modeling

Homology modeling was used to generate a structural

protein model for study. The human V1aR receptor (414

J Mol Evol (2009) 68:28–39 29

123

amino acid residues) was modeled using a rhodopsin

template. Rhodopsin is one of only two possible templates,

and is the best-studied template for modeling G-protein-

coupled receptors. V1aR and rhodopsin (PDB code: 1f88)

were aligned using a dynamic programming method with

gaps suppressed in transmembrane domains. Transmem-

brane segments in GPCRs cannot tolerate gaps or insertions

and must occur in a fixed order. Thus weakly similar

segments can be aligned. The alignment of conserved

GPCR motifs (Shi and Javitch 2002) in each of the seven

transmembrane domains was confirmed by eye. Model-

ler8v1 (Fiser and Sali 2003) was used for homology

modeling with the Automodel setting. No attempt was

made to refine the loops of the receptor, which were also

modeled on rhodopsin and contained conserved Cys resi-

dues that constrain rhodopsin loops. Visually, the modeled

loops appeared to be located in appropriate regions of the

receptor.

Autocorrelation Calculations

To understand the role of clustered amino acid replacement

rates, autocorrelation, the tendency of such rates to cluster,

was studied. Autocorrelation of site variation was deter-

mined by finding the correlation of residue amino acid

replacement rate with rates of neighboring residues. This

was accomplished by quantifying the similarity of an index

residue amino acid replacement rate to the rates of residues

within a threshold distance in the context of the folded

protein. All residues in turn were used as index residues so

the analysis was not biased.

The autocorrelation approach adopted here would be

compatible with either parsimony- or model-based meth-

ods of determining rate. Though parsimony has some

limitations, it was adopted for this study. Evolutionary

amino acid replacement rates for each residue of the

protein were determined on a parsimony tree inferred by

Paup* 4.0 (Swofford 1998) using the category ‘tree steps’

as a measure of site rate. Moran’s (1950) I was determined

to calculate spatial autocorrelation. Moran’s I takes values

between -1 (negative autocorrelation) and 1 (autocorre-

lation), with 0 representing no autocorrelation. Distances

were measured from the C-beta carbon of all residues

except glycine, for which the C-alpha carbon was used.

Significance levels were determined by nonparametric

bootstrap replication. Sites within the distance threshold

were resampled to test if the value of Moran’s I remained

above 0. To eliminate the effect of linear autocorrelation,

only sites separated by more than seven residues were

included in some calculations.

Spatial Partitions

Using spatially clustered sites, spatial dependence of rates

could be determined. VPRs were partitioned into 3D spa-

tially contiguous clusters. These clusters were not

contiguous in linear sequence. Using the program Contact

(Marsh 2006) with settings to output clusters, (10 A)3 cubic

regions were defined. A limit of a minimum of six residues

per cluster was set and residues not associated with a valid

cluster were placed in a spare, nonspatial cluster. The total

number of clusters was 47 (46 spatial clusters averaging

8.43 sites and 1 spare cluster of 26 sites). Residues falling

into more than one spatial cluster were randomly assigned

to one or the other. The net result was an assignment of

each residue to one and only one spatial grouping. New

data matrices were generated by rearranging sites of the

data matrix to make contiguous partitions containing sites

of each cluster. Rates were estimated independently for 46

partitions.

An advantage of the 47 clusters was that they allocated

each residue once and only once. A disadvantage was that

Table 1 Receptor proteins used in analysis

Source Abbreviation(s) Ligand

Homo sapiens, human VPR1a, V1BRhum, V2Rhum Vasopressin

OXYRhum Oxytocin

Bos taurus, cow VPR1Cow, V2Rbovin Vasopressin

OXYRbovin Oxytocin

Rattus norvegicus, rat V1ARrat, V1BRrat, V2Rrat Vasopressin

OXYRrat Oxytocin

Gallus gallus, chicken VTR1chick, VTRpitchick Vasotocin

Rana catesbeiana, frog VTRrana Vasotocin

Bufo marinus, toad MTRbufma Mesotocin

Takifugu rubripes, fish VTR1Ataki, VTR1Btaki Vasotocin

Catostomus commersoni, fish ITRcatos Isotocin

Octopus vulgaris, octopus OPRoct Octopressin

30 J Mol Evol (2009) 68:28–39

123

if a special protein feature lay at the boundary of two or

more clusters, it would be split. A second set of clusters

was defined, also 10-A3 cubes, but now comprised of 380

overlapping clusters. This set was used for evolutionary

feature detection and to find individual clusters with

anomalous behavior. To determine amino acid replacement

rates for the 380 clusters, an iterative analysis was used.

One cluster was selected at a time to make a partition, with

the remainder of the protein comprising a second partition.

The amino acid replacement rate of the cluster partition

was recorded and the likelihood of the model with the

partitions was noted.

Sites were also partitioned based on surface accessibil-

ity. The VP1aR protein was analyzed as described for

accessibility (Marsh and Griffiths 2005). Sites were divided

into surface (C4.9-A2 accessible surface) or core (\4.9-A2

accessible surface). The cutoff value for categorizing as

core or surface was determined by optimization of maxi-

mum likelihood (ML) trials. The estimated evolutionary

rate for the surface partition was 2.21 times that of the core

partition. It should be noted that in shape and scale, these

accessibility partitions differed greatly from the spatial

partitions described above.

Maximum Likelihood Analysis

ML studies were carried out to compare cluster methods to

other evolutionary models. In particular, we wanted to

understand whether clustered rate variation based on

location in the 3D structure of the protein fit evolutionary

data better than alternative models for the data. Models

tested included a single rate model, the gamma rate dis-

tribution model, and mixed models. ML studies used the

Codeml module of PAML3.14 application. Mega3.1 was

used to generate a VPR topology using the neighbor-join-

ing (N-)J method and Poisson correction setting. Variations

in tree topology are predicted not to have a large effect on

the analysis used here. Indels in alignment files and the

structural files were removed by the ‘complete deletion’

method. Files for Paml were rearranged during partitioning

to allow noncontiguous sites of each cluster to be grouped.

Simulated unpartitioned proteins were generated with

Evolver in the Paml package with a JTT evolutionary

model and a gamma rate distribution model (a = 0.78,

based on a VPR tree analysis). Sites of simulated protein

were randomly reordered to remove latent linear

autocorrelation.

Model Comparison

The ML analyses provided information about how well a

model fit a set of evolutionary data. It was important to

compare models and test whether differences were

significant. For instance, we wanted to know if the cluster

model was significantly better than a single amino acid

replacement rate model on the VPR data set. Models were

compared using the Akaike Information Criterion (AIC).

AIC = 2(L(h2) - L(h1)) - 2(p2 - p1), where L(h1) is the

likelihood given inferred parameters, p, of one model and

(L(h2) is the likelihood of an alternative, not necessarily

nested, model. Alternatively, Bayesian models were com-

pared using a Bayes factor. To calculate the Bayes factor,

the harmonic mean of posterior probabilities was taken as

an estimate of average likelihood (Huelsenbeck and Ron-

quist 2001; Newton and Raftery 1994). The Bayes factor

was calculated as the ratio of likelihoods of two models.

Amino Acid Replacement Rate Dispersion Between

Clusters

Clusters were analyzed for overdispersion of rate, that is,

greater variation in rate between clusters than expected by

chance. The mean number of evolutionary steps per cluster

was used to classify clusters into rate categories and the

distribution of rate classes was tested statistically. A par-

simony tree generated by Paup* 4.0 was used to determine

evolutionary steps for each site in a cluster. Steps were then

summed for each cluster and adjusted for cluster size. For

Poisson tests of overdispersion the variability of number of

sites for each cluster necessitated a different approach.

Clusters of different sizes could not be mixed. Instead the

problem was divided and groups of clusters with the same

number of residues were analyzed together. For each size

class of cluster, a comparison was made to the model

Poisson distribution for the data mean and sample number

appropriate for that size cluster. The results of each size

class analysis were then merged using weighted sums cor-

responding to the number of clusters in each size class. The

net result was a comparison of the rates of clusters to a

model Poisson-based curve. To determine significance the

calculations were repeated using jackknife (50%) replicas,

dropping sites out of clusters. Site rates can be thought of as

having variability due to autocorrelation effects and vari-

ability due to random error. This test only analyzed

variability of rates due to autocorrelation, but random error

was predicted to be Poisson-distributed and hence not

interfere with the analysis. The statistic variance/mean was

evaluated, with a value [1 indicating overdispersion. The

Poisson distribution is expected for randomly distributed

amino acid replacement rates, whereas overdispersed or

autocorrelated rates will exhibit broader distribution curves.

Simulated Spatially Autocorrelated Trees

The ability of the cluster model to function in phylogenetic

inference was studied. Simulation was used to test the

J Mol Evol (2009) 68:28–39 31

123

ability of various models to detect the correct topology of a

tree. The goal was to assess the potential of the structural/

spatial model in phylogenetic reconstruction. Phylogenies

were simulated using the Evolver program in the PAML

3.14 package (Yang 1997). To simulate autocorrelated

protein phylogenies, an unrooted tree of four proteins with

unequal branch lengths was used, with no molecular clock.

Twenty partitions were made in a simulated protein of 400

residues. Each partition was set to a fixed rate to mimic a

spatial cluster. For trials the distribution of amino acid

replacement rates among partitions during simulation was

set by a gamma rate distribution with a = 1 or 2. All trees

simulated for this test emulated spatial autocorrelation but

the analysis models varied and included spatial autocorre-

lation, gamma model, and single rate. More than 100 trials

with randomly simulated trees were analyzed by ML with a

given model and the tree topology with the highest likeli-

hood was noted. The same set of data was analyzed by each

model. The proportion of correct topologies was deter-

mined for each method.

Bayesian Evolutionary Analysis

To test the utility of the spatial partition method in phy-

logenetic analysis, Bayesian phylogenetic reconstruction

was carried out. Evolutionary analysis was carried out with

the VPR data set. Partitioned (clustered) and nonpartitioned

(gamma, single rate) models were tested. MrBayes1.1.1

(Huelsenbeck and Ronquist 2001) was used. For the cluster

model, data matrix partitions were set to VPR spatial

clusters. Partitions were derived from 47 (10 A)3 clusters,

each of which contained a mutually exclusive set of resi-

dues and at least six residues. Each partition was allowed to

equilibrate to its own rate.

An unpartitioned analysis using the gamma model

served as a control, with four amino acid replacement rate

categories used with the discrete gamma option. Bayesian

reconstruction was carried out using four independent

MCMC processes, three heated and one cold. The protein

model was JTT. For Bayes factor analysis, the harmonic

mean of each posterior probability was calculated and used

in model selection as described above.

Results and Discussion

Spatial Autocorrelation of Amino Acid Replacement

Rates and Evolution

Structural influences play an important role in protein

evolution. One of the main goals of this work was to study

whether amino acid replacement rates in VPRs (Table 1)

were spatially clustered and, if so, whether those clusters

had functional significance. The first test was to assay

whether residues positioned next to one another in the

folded structure had, on average, similar rates of evolu-

tionary change. It is known that amino acid replacement

rates of residues near one another in the linear protein

sequence are autocorrelated (Stern and Pupko 2006;

Mayrose et al. 2007). We wanted to determine whether that

correlation could be extended to 3D structure.

Spatial autocorrelation of amino acid replacement rates

was tested on different spatial scales. Moran’s I was used to

quantitate autocorrelation using steps on a parsimony VPR

tree as a surrogate for site rates.

I ¼ RiRjsisj

� �= Ris

2i

� �ð1Þ

where si was the difference between the number of steps at

a site i and the mean number of steps for the population of

sites. Autocorrelation and clustering, by their natures,

depend on distance. The analysis was thresholded to

include only residue pairs with a through-space Euclidian

distance less than some test value representing spatial

scale. Correlation of the amino acid replacement rates of

residues separated by different distances was analyzed. As

shown in Fig. 1, the peak distance for autocorrelation of

rates was 7 A, which is approximately the range of inter-

action of an amino acid residue with surrounding residues.

One interpretation of this result is that rates of amino acid

residues that are in contact in the folded protein tend to be

similar. Autocorrelation values were modest, but they were

significant (P \ 0.01) by bootstrap analysis and remained

significant when linear autocorrelation was removed from

the calculation. At a 7-A spatial distance, Moran’s I with

linear autocorrelation removed was 0.497 (bootstrap

P \ 0.005), higher than with linear autocorrelation inclu-

ded. Thus there is a significant chance that any amino acid

0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20 25 30

Distance (Angstroms)

Au

toco

rrel

atio

n

Fig. 1 Spatial autocorrelation of amino acid replacement rates in

VPRs. Autocorrelation (Moran’s I) was calculated for different

residue distances to test for possible clustering of rates. Each distance

was inclusive, that is, it included all intermediate distances. The rates

were significantly autocorrelated, using a nonparametric bootstrap

replication test (P \ 0.01)

32 J Mol Evol (2009) 68:28–39

123

residue of VPR will be in contact with residues of similar

rate.

Testing a Spatial Cluster Model of Amino Acid

Replacement Rate Variation

Since spatial autocorrelation occurred, we wanted to

determine whether a model incorporating these results

would be practical. A pure autocorrelation model proved

too unwieldy to be useful. Instead, a simpler, related model

based on partitioning proteins by spatial criteria was tested

to see if it exhibited an improved fit for protein evolution.

A distance of 7 A (the peak from the previous analysis) is

approximately the average distance between points in a

(10-A)3 cube. VPR was divided into 46 contiguous (10-A)3

clusters (plus 1 cluster for stray residues from poorly

occupied clusters), which were used to partition the pro-

tein. This partitioning of data into amino acid replacement

rate classes based on spatial clusters is termed the ‘cluster

model.’ Each cluster had on average 8.4 residues and a

minimum of 6 residues.

To determine if a cluster model improved the descrip-

tion of VPR evolution, ML analysis was used. The goal

was to determine which model fit the evolutionary data

better. For this analysis a fixed tree was used, with branch

length estimation. The amino acid replacement rate of each

spatial cluster partition was independently estimated. ML

analysis with the spatial partitions led to a significantly

better model fit than the single amino acid replacement rate

model (likelihood ratio test [LRT], P \ 0.001). Table 2

shows model comparisons as determined by AIC. These

model comparisons were not nested and could not use the

LRT. AIC model comparison is a method in which addition

of parameters is penalized (Akaike 1974). For the cluster

model, amino acid replacement rate parameters had to be

estimated for 46 additional partitions. But despite the

penalization incurred for estimation of these parameters,

the model was supported. The gamma rate distribution was

also tested. As expected the gamma distribution model of

rates was supported as being superior to the single rate

model. A double model with the cluster model plus the

gamma model was somewhat better than the gamma model

alone or the cluster model alone. The observation that the

double model was more likely than either single model

suggested that the gamma and spatial models might capture

distinct features of protein evolution for this data set.

A number of models for spatial effects on evolution

consider differences in amino acid replacement rate

between inaccessible core residues of a protein and

accessible surface residues. When the VPR proteins were

partitioned based on accessibility, the AIC improved

(Table 2). The magnitude of the improvement, however,

was unexpectedly low. This result probably reflected the

fact that the VPR is a membrane protein. Membrane pro-

teins can exhibit lower accessibility effects than soluble

globular proteins (Goldman et al. 1998; Choi et al. 2007).

Accessibility and spatial clustering are very different ways

of partitioning structure, but each captured significant

correlations with the amino acid replacement rate. The

relative importance of accessibility and spatial effects

could not be generalized by these experiments on a single

protein.

Not all rate partitionings of the protein are equal. When

partitions were generated randomly, rather than being

based on spatial proximity, the AIC test showed that the

random partitionings were inferior to spatial partitioning

(Table 2). Thus success with the partitioning method

appeared to require partitions of clustered residues. It is

reasonable to imagine that these clusters often contain

groups of residues that comprise spatially discrete natural

features. The poor AIC tests of random partitions also

support the validity of the homology model used here as a

structural reference. A poor structural model would gen-

erate partitions no different than random partitions.

Variance of Cluster Amino Acid Replacement Rates

Further tests were performed to examine the nature of the

rate clusters. It was of interest to determine whether the

distribution of amino acid replacement rates of clusters in

proteins was overdispersed. Overdispersed rates would be

more scattered than a Poisson distribution which describes

random clusters. Overdispersion is a predicted feature of

autocorrelation and would be an important confirmation

that the spatial autocorrelation was present. Also, if clusters

with outlying rates existed, these clusters would be can-

didates for regions in which selection of some sort is

Table 2 ML comparison of cluster and gamma models with VPR

evolutionary data

Model 1 (no.

parameters)

Model 2 (no. parameters) AICa

One rate (0) 47 spatial partitions (46) 482.2

One rate (0) Gamma rate distribution (1) 619.6

One rate (0) Core/surface partition (1) 158.6

Gamma model (1) 47 spatial partitions (46) –197.4

47 spatial partitions (46) 47 spatial partitions ? gamma

(47)

264.6

Gamma model (1) 47 spatial partitions ? gamma

(47)

73.2

47 spatial partitions (46) 47 random partitions (46) -317.8

’’ ’’ -380.0

’’ ’’ -353.6

a Akaike Information Criterion. A positive value indicates that model

2 fits better than model 1

J Mol Evol (2009) 68:28–39 33

123

occurring. The VPR protein was subjected to cluster

analysis to determine dispersion (Fig. 2). The results

indicated that rates of spatial partitions are overdispersed in

VPR. Compared to a Poisson distribution, there were sig-

nificantly more low rate and high rate clusters in the VPR

amino acid replacement rate distribution. This analysis

used clusters of about eight residues, and larger or smaller

clusters could give different results. Since the chosen

cluster size (10 A)3 was about one-fourth to one-third of

the protein diameter, larger clusters might become larger

than functional regions of the protein. From a biological

perspective, autocorrelation might be highest in clusters

with specialized function (negative selection, low rate) or

lack of function (lack of negative selection, fast rate). It is

possible that some clusters were more autocorrelated than

others and that the variance analysis included a mixture of

strongly and weakly autocorrelated clusters. Nonetheless,

the analysis was significant for overdispersion, supporting

autocorrelation.

To further study the nature of spatial differences in

evolutionary amino acid replacement rate, variance in rate

for simulated VPRs and real VPRs was compared. Simu-

lated proteins were generated, without structural input,

under a gamma rate distribution with the gamma a param-

eter set to that estimated from the VPR data set. The amino

acid replacement rates of 380 overlapping spatially defined

(10-A)3 cubic clusters were tested. The distribution of rate

estimates for the natural and simulated VPRs is shown in

Fig. 3. The amino acid replacement rates of clusters in

simulated proteins generated under a gamma rate distribu-

tion (which is overdispersed) exhibited little variance, while

rates for real VPRs were significantly more scattered. This

result shows that spatial autocorrelation is a selected feature

of natural proteins not present in typical simulated proteins.

The gamma distribution assures overall site rate overdis-

persion ut does not cluster the rates spatially. Simulated

proteins, in their simplest form, cannot be used to test for

evolution based on spatial features of proteins.

Simulated Evolution with Spatial Autocorrelation

A variety of other evolutionary tests was used to examine

the cluster model. One test of a model is its ability to

reconstruct phylogenies that were generated using the

model to simulate protein trees. As described above, simple

simulated proteins do not capture the features of folded

proteins. To simulate proteins with structure, simulated

proteins were generated with partitions equivalent to

clusters in a protein structure. Simulations were carried out

using the cluster model to generate mock 3D-based trees.

Trees of four proteins were generated, with long external

branches and a short internal branch. These conditions have

been shown to add complexity to phylogenetic inference

(Felsenstein 1978). The gamma distribution was used to

simulate the amino acid replacement rates of the autocor-

related partitions and the a value was varied (Table 3) (see

Methods). Alternative evolutionary models (single rate,

gamma, cluster) were used with the PAML3.14 application

to attempt to predict the generating topology. The cluster

model was significantly better than the gamma model for

one of the tested set of conditions (Table 3, row 3).

However, overall the cluster model and gamma model had

similar predictive performances on the cluster simulated

trees. These results suggest that for some trees the cluster

-5

0

5

10

15

20

25

1 3 5 7 9 11 13 15 17 19

Rate

Fre

qu

ency

Fig. 2 Amino acid replacement rate variation in clusters. Rates in

clusters set as partitions were compared to a model Poisson

distribution to determine if cluster rates were overdispersed. Dashed

lines, Poisson model expected for normally dispersed observations;

solid line, observed distribution of rate. Poisson model was rejected

by jackknife test for dispersion (P \ 0.001). Analysis performed on

VPR data set

0

50

100

150

-3 -2 -1 0 1 2

Ln Rate

Nu

mb

er o

f C

lust

ers

Fig. 3 Differential distribution of amino acid replacement rates in

VPRs and simulated proteins. The estimated amino acid replacement

rates of 380 overlapping spatial clusters were analyzed for distribu-

tion of rates. As a comparison, a simulated protein, for which spatial

information was irrelevant, was analyzed in the same way. Dashed

line, distribution of rates for VPR proteins; solid line, distribution of

rate in the simulated protein. The differences in variances were

significant (P \ 0.01) using a Fligner nonparametric test

34 J Mol Evol (2009) 68:28–39

123

model would produce more accurate results. However, no

single model was best under all conditions. Under some

conditions the misspecified single rate model performed

better than the cluster model that generated the data. This

occurred with a tree known to cause difficulty for ML

inference (Yang 1996).

Bayesian Phylogenetic Inference with Cluster Partitions

Spatial autocorrelation was applied to a VPR phylogenic

analysis to see if the cluster method could create an

accurate phylogeny. A model based on the cluster method

was used for Bayesian phylogenetic inference. The VPR

matrix was divided into 47 spatial partitions. Based on ML

analysis the phylogenetic tree contained short internal

branches between the major groupings of VPRs (VPR1a,

VPR1b, VPR2, OTR) as the trees in the simulation did (not

shown). The analysis was also performed with the gamma

rate distribution model. As shown in Fig. 4 (cluster) and

Fig. 5 (gamma model), the cluster method was able to

produce a consistent tree with resolution similar to or better

than that with the gamma model. With the cluster model

the placement of teleost and amphibian receptors on the

oxytocin receptor lineage was also more in accord with

accepted evolutionary relationships for lower vertebrates

and mammals. It is notable that the cluster method is

compatible with existing phylogenetic applications once

spatial partitions have been assigned. The model should be

applicable to many proteins that have spatially defined

features expected to evolve at a rate different from the bulk

Table 3 Topology inference on

trees simulating spatial

autocorrelation

* P \ 0.005, comparing cluster

model to gamma rate

distributiiona Discrete gamma a value for

distribution of partition rates

during simulation

Tree Partition rate

distribution, aInference model: % correct topology

Cluster Gamma Single rate

1. ((A:3,B:3),.1(C:3,D:3)) 1a 47.6* 32.9 40.5

2. ’’ 2 87.3 87.3 86.4

3. ((A:3,B:.1),.1(C:3,D:.1)) 1 78.1 75.2 5.2

4. ’’ 2 67.3 65.5 65.5

5. ((A:3,B:3),.1(C:.1,D:.1)) 1 49.1 49.1 99.1

6. ’’ 2 66.4 63.6 80.0

OPR OCT

ITR CATOS

MTR BUFMA

OXYR RAT

OXYR BOVIN

OXYR HUMAN0.76

1.00

0.75

1.00

VTR1 CHICK

V2R RAT

V2R BOVIN

V2R HUMAN0.69

1.00

1.00

VTRpit CHICK

V1BR HUMAN

V1BR RAT

1.00

1.00

VTR1b TAKI

VTR1a TAKI1.00

VTR RANA

VPR1 COW

hVPR1a

V1AR RAT1.00

1.00

1.00

1.00

0.55

0.67

Fig. 4 VPR phylogeny

generated with a cluster

partition model. Bayesian

phylogenetic inference with

VPR-related receptors. Forty-

seven spatial partitions were

used, with amino acid

replacement rates allowed to

vary between partitions. The

reconstruction shown is a 50%

majority-rule consensus tree.

Numbers at nodes represent the

posterior probability that the

node is supported

J Mol Evol (2009) 68:28–39 35

123

of the protein. A special feature of the cluster method, as

presented, is that the location of spatial features need not be

defined in advance.

As described above, ML analysis supported the cluster

model. That analysis involved only estimation of branch

lengths and not tree topology. With the Bayesian phylo-

genetic analysis there was an opportunity to test the cluster

model in the broader context of tree inference. Bayes factor

analysis of Bayesian phylogenetic analysis exhibited a

pattern of model support similar to the ML analysis

(Table 4). In particular, both the cluster model and the

gamma rate distribution model were better than a single

rate model by Bayes factor analysis. The addition of the

cluster model to the gamma model (gamma rate distribu-

tion within each cluster partition) significantly improved

the likelihood, again supporting the concept that these two

approaches to among-sites amino acid replacement rate

variation capture distinct features of the VPR data set. The

cluster model with 47 partitions was superior in compari-

sons to a model with 47 random partitions unrelated to

spatial clustering. The posterior probabilities of the spatial

autocorrelation and the double gamma autocorrelation

models converged after \1 million cycles and 2 million

cycles, respectively, of the MCMC process. The cluster

models, despite their size and a certain level of ineffi-

ciency, did not require special treatment other than

allowing sufficient time for MCMC equilibration.

Finding Spatial Regions of Anomalous Amino Acid

Replacement Rates

One goal for analysis of autocorrelated clusters was to

identify candidates for selection. The justification for the

cluster model was the concept that amino acid replacement

rate variation for protein sites might have a biological

underpinning. Functional features of a protein may evolve

at rates different from the protein as a whole. If biological

function supports spatial autocorrelation, then clusters

identified by ML as having anomalous amino acid

replacement rates might be associated with selection.

Regions of proteins evolving at a slower than normal or

faster than normal rate were identified. The most important

regions were identified by the increase in likelihood that

occurred when the region was set as a partition against the

bulk of the protein. These partitions were not necessarily

the same as those with the fastest or slowest amino acid

replacement rate. Instead they seemed to be the partitions

with a large number of sites of uniformly high or low rates.

From the list of the effect of each individual cluster par-

tition on ML, the partition with the largest effect was

chosen (Fig. 6). This low amino acid replacement rate

cluster mapped to the VPR core including portions of

transmembrane (TM) domain 1, TM2, and TM7. This

cluster had, by far, the largest effect on likelihood when set

as a partition during analysis. A high amino acid replace-

ment rate cluster with a large effect on ML (not shown)

contained a portion of the intracellular C-terminal domain

of VPR including cytoplasmically facing residues of helix

8. Both the low rate and the high rate clusters were rea-

sonable candidates for regions whose amino acid

replacement rates would be correlated for functional rea-

sons. The low rate cluster included a portion of an H-

bonded network thought to stabilize GPCRs in the off state

OPR OCT

VTR RANA

VPR1 COW

hVPR1a

V1AR RAT

0.53

1.00

1.00

VTR1b TAKI

VTR1a TAKI1.00

1.00

VTRpit CHICK

V1BR HUMAN

V1BR RAT1.00

1.00

VTR1 CHICK

V2R RAT

V2R BOVIN

V2R HUMAN0.71

1.00

1.00

0.63

MTR BUFMA

ITR CATOS

0.62

OXYR RAT

OXYR BOVIN

OXYR HUMAN0.67

1.00

1.00

Fig. 5 VPR phylogeny with gamma amino acid replacement rate

distribution. A control Bayesian phylogenetic analysis with a gamma

rate distribution model. A 50% majority-rule consensus tree is shown.

Numbers at nodes represent posterior probability that the node is

supported. The gamma a value estimate was 0.891 in this analysis

Table 4 Assessing model quality by Bayes factors derived from

Bayesian phylogenetic reconstructions of a VPR phylogeny

Model Likelihood (Ln) Bayes factora (Ln)

Comparison to single rate model

Single rate -11,041 [0]

Cluster -10,711 330

Gamma -10,624 417

Comparison to gamma model

Gamma ? cluster -10,518 106

a Model compared to model (Gelman et al. 2004)

36 J Mol Evol (2009) 68:28–39

123

prior to activation (Okada et al. 2002). Mutation of resi-

dues in this region constitutively activates some GPCRs

(Robinson et al. 1992). The C-terminal high rate region, by

contrast, is potentially one of the areas of the receptor

involved in coupling to G protein. Since the receptors

included in this phylogeny include types coupling to sev-

eral different G-protein types, it is reasonable that rates in

this region would be highly variable (Strader et al. 1994).

Presumably there has been selection for change in this

region as the G-protein type involved in coupling changed.

Specific residues in this region have been shown to be

essential for individual GPCR family members to function

and couple to their cognate G-protein type.

Earlier it was shown that the amino acid replacement

rate distribution of spatial clusters was significantly over-

dispersed. One interpretation of that analysis was that rates

in VPRs were randomly distributed according to some

distribution with overdispersed properties. A more bio-

logical explanation might be that VPRs contain functional

regions and that the differing rates associated with selec-

tion on those regions contribute to overdispersion and

spatial autocorrelation. This second perspective suggests

that the extreme rate clusters are not outliers of an over-

dispersed statistical distribution. Instead it appears that they

represent, as above, biologically significant regions.

Lineage Specific Amino Acid Replacement Rate

Clusters

To this point the methods described have focused on

cluster rates across the entire VPR phylogeny. Defining

clusters of residues whose selection differs by lineage is an

interesting problem. By specifically seeking clusters that

differed in amino acid replacement rate by lineage, the

focus was on structural regions whose function had chan-

ged over the tree. For this analysis the paralogue OTR and

VP1aR lineages were separated and ML analysis was

performed separately on each grouping. Figure 7 shows a

cluster with a high estimated rate difference between the

OTR and the VP1aR lineages. This cluster includes part of

the extracellular loop region which has been implicated in

ligand binding (Shi and Javitch 2002). The OTR amino

acid replacement rate was lower than the VP1aR rate for

this cluster (the lineage cluster). The most likely explana-

tion for the lineage cluster is that evolutionary constraints

are relaxed for VP1aR receptors. A change in evolutionary

constraints for other VPR receptors (due to a change in

ligand) has been associated with changes in the ligand

binding site (Cho et al. 2007). Oxytocin and vasopressin,

the principal ligands of OTR and VP1aR, respectively, are

similar, but not identical ligands. Each receptor binds the

receptor of the other to some extent, and this might have

physiological significance. So lineage-specific changes in

the ligand-binding region are of interest. Though this

technique cannot detect selection per se, it can highlight

regions that are candidates for various types of selection.

Conclusion

A significant amount of protein amino acid replacement

rate variation is correlated with the location of the site in

the folded protein structure. For sites in VPRs, the rate was

spatially autocorrelated or clustered. This clustering of

amino acid replacement rates was supported by several

independent approaches including ML and Bayesian anal-

yses. The cluster model is conceptually simple and can be

applied to a number of evolutionary analyses. This method

allowed useful partitioning of sites into spatial clusters. The

current model focused on clusters of residues in the protein

which might be described as functional units. These clus-

ters of amino acids may have shared rates because of

shared selection and shared function. The cluster method

captured the fact that amino acid replacement rates are

Fig. 6 Outlier amino acid replacement rate cluster of the VPR

receptors. Three hundred eighty overlapping spatial clusters were

tested for those that improved the likelihood most when set as a

separate rate partition. A structural model of human V1aR is shown.

The low-rate cluster is indicated in gray. Residues in the cluster are

presented as spacefilling spheres to emphasize that the amino acid

residues in the cluster are spatially contiguous. The rest of the protein

is depicted by a white ribbon schematic display with the ligand

binding domain toward the top of the figure and the G-protein binding

domain at the bottom

J Mol Evol (2009) 68:28–39 37

123

clustered without requiring that the functional regions of

the protein be identified. When functional regions were

known, that information was incorporated. The ability to

identify the rate of clustered sites allowed identification of

regions with differing rates of evolution. As an example,

we were able to identify a region of the VP1aR ligand

binding site with a rate different from that of the corre-

sponding region of OTR, suggesting differences in

selection for peptide ligand interaction on the two para-

logue lineages. Thus this autocorrelation/cluster model

helped provide insight into the evolution of VPR functions.

It is likely that autocorrelation will apply to other proteins,

especially proteins with regions that carry out some spe-

cific function.

Acknowledgments I thank J. L. Thorne and an anonymous

reviewer for helpful suggestions. Carole Griffiths provided stimulat-

ing discussion and advice. The LIU Biocomputing facility provided

resources.

References

Akaike H (1974) A new look at the statistical model identifications.

IEEE Trans Automat Contr AC-19:716–723

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Wissig H,

Shindyalov IN, Bourne PE (2000) The protein data bank.

Nucleic Acids Res 28:235–242

Chakrabarti S, Lanczycki CJ (2007) Analysis and prediction of

functionally important sites in proteins. Protein Sci 16:4–13

Chakrabarti S, Sowdhamini R (2004) Regions of minimal structural

variation among members of protein domain superfamilies:

application to remote homology detection and modelling using

distant relationships. FEBS 569:31–36

Cho HJ, Acharjee S, Moon MJ, Oh DY, Vaudry H, Kwon HB, Seong

JY (2007) Molecular evolution of neuropeptide receptors with

regard to maintaining high affinity to their authentic ligands. Gen

Comp Endocrinol 153:98–107

Choi SS, Vallender EJ, Lahn BT (2006) Systematicallly assessing the

influence of three-dimensional structural context on the molec-

ular evolution of mammalian proteomes. Mol Biol Evol

23:2131–2133

Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL (2007)

Quantifying the impact of protein tertiary structure on molecular

evolution. Mol Biol Evol 24:1769–1782

Dean AM, Neuhauser C, Grenier E, Golding GB (2002) The pattern

of amino acid replacements in alpha/beta-barrels. Mol Biol Evol

19:1846–1864

Elango N, Kim SH, Vigoda E, Yi SV (2008) Mutations of different

molecular origins exhibit contrasting patterns of regional

substitution rate variation. PLoS Comput Biol 4:e1000015

Felsenstein J (1978) Cases in which parsimony or compatibility

methods will be positively misleading. Syst Zool 27:401–410

Felsenstein J (2001) Taking variation of evolutionary rates between

sites into account in inferring phylogenies. J Mol Evol 53:447–455

Fiser A, Sali A (2003) Modeller: generation and refinement of

homology-based protein structure models. Methods Enzymol

374:461–491

Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Model checking

and improvement. In: Gelman A, Carlin JB, Stern HS, Rubin DB

(eds) Bayesian data analysis. Chapman and Hall, New York,

pp 157–192

Goldman N, Thorne JL, Jones DT (1998) Assessing the impact of

secondary structure and solvent accessibility on protein evolu-

tion. Genetics 149:445–458

Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference

of phylogenetic trees. Bioinformatics 17:754–755

Huelsenbeck JP, Suchard MA (2007) A nonparametric method for

accommodating and testing across-site rate variation. Syst Biol

56:975–987

Marsh L (2006) Evolution of structural shape in bacterial globin-

related proteins. J Mol Evol 62:575–587

Marsh L, Griffiths C (2005) Protein structural influences in rhodopsin

evolution. Mol Biol Evol 22:894–904

Mayrose I, Friedman N, Pupko T (2005) A gamma mixture model

better accounts for among site rate heterogeneity. Bioinformatics

21(Suppl 2):ii151–ii158

Mayrose I, Doron-Faigenboim A, Bacharach E, Pupko T (2007)

Towards realistic codon models: among site variability and

dependency of synonymous and non-synonymous rates. Bioin-

formatics 23:i319–i327

Moran PA (1950) Notes on continuous stochastic phenomena.

Biometrika 37:17–23

Newton MA, Raftery AE (1994) Approximate Bayesian inference by

the weighted likelihood bootstrap (with discussion). J Roy Stat

Soc Ser B 56:3–48

Ninio M, Privman E, Pupko T, Friedman N (2007) Phylogeny

reconstruction: increasing the accuracy of pairwise distance

estimation using Bayesian inference of evolutionary rates.

Bioinformatics 23:e136–e141

Okada T, Fujiyoshi Y, Silow M, Naverro J, Landau EM, Shichida Y

(2002) Functional role of internal water molecules in rhodopsin

revealed by X-ray crystallography. Proc Natl Acad Sci USA

99:5982–5987

Fig. 7 Spatial clusters of amino acids evolving at different rates in

the oxytocin receptor and vasopressin 1a receptor lineages. Overlap-

ping spatial clusters were tested for regions with a different amino

acid replacement rate for two lineages. The best candidate is shown in

gray spacefill presentation. General features are similar to those in

Fig. 6. This region is evolving more slowly in oxytocin receptors than

V1a receptors. The selected residues comprise a portion of the region

of the predicted binding site for vasopressin/oxytocin

38 J Mol Evol (2009) 68:28–39

123

Robinson PR, Cohen GB, Zhukovsky EA, Oprian DD (1992)

Constitutively active mutants of rhodopsin. Neuron 9:719–725

Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL (2003)

Protein evolution with dependence among codons due to tertiary

structure. Mol Biol Evol 20:1692–1704

Shi L, Javitch JA (2002) The binding site of aminergic G protein-

coupled receptors: the transmembrane segments and second

extracellular loop. Annu Rev Pharmacol Toxicol 42:437–467

Stern A, Pupko T (2006) An evolutionary space-time model with

varying among-site dependencies. Mol Biol Evol 23:392–400

Strader CD, Fong TM, Tota MR, Underwood D, Dixon RAF (1994)

Structure and function of G protein-coupled receptors. Annu Rev

Biochem 63:101–132

Susko E, Field C, Blouin C, Roger AJ (2003) Estimation of rates-

across-sites distributions in phylogenetic substitution models.

Syst Biol 52:594–603

Swofford DL (1998) PAUP*: phylogenetic analysis using parsimony

(*and other methods. Version 4. Sinauer Associates, Sunderland,

MA

Thomson JD, Higgins DG, Gibson TJ (1994) Clustal W: improving

the sensitivity of progressive multiple sequence alignment

through sequence-weighting, position-specific gap penalties,

and weight matrix choice. Nucleic Acids Res 22:4673–4680

Van Damme EJ, Nakamura-Tsurata S, Smith DF, Ongenaert M,

Winter HC, Rouge P, Goldstein IJ, Mo H, Kominami J, Culerrier

R, Barre A, Hirabayashi J, Peumans WJ (2007) Phylogenetic and

specificity studies of two-domain GNA-related lectins: genera-

tion of multispecificity through domain duplication and

divergent evolution. Biochem J 404:51–61

Yang Z (1994) Maximum likelihood phylogenetic estimation from

DNA sequences with variable rates over sites: approximate

methods. J Mol Evol 39:306–314

Yang Z (1996) Phylogenetic analysis using parsimony and likelihood

methods. J Mol Evol 42:294–307

Yang Z (1997) PAML: a program package for phylogenetic analysis

by maximum likelihood. Comput Appl BioSci 13:555–556

J Mol Evol (2009) 68:28–39 39

123

Download - Spatial Autocorrelation of Amino Acid Replacement Rates in the Vasopressin Receptor Family

Top Related