evolution studied using protein · own unique 3d structure (also called fold) and corresponds to a...

14
23 EVOLUTION STUDIED USING PROTEIN STRUCTURE Song Yang, Ruben Valas, and Philip E. Bourne One of the principle goals of evolutionary biology is to generate phylogeny that best represents the evolutionary histories of all organisms on earth. Aside from directly investigating the fossil records of ancestor species, all phylogenetic methods depend on the comparison of specific features (homologous characteristics) of contemporary organ- isms to determine the evolutionary relationships between different organisms. Among the features are morphological, physiological, genetic, and genomic which changed as the organisms evolved. The study of evolution changed dramatically with the discovery of DNA and the evolutionary fingerprint it represents. Evolutionary relationships between organisms can be studied by comparing their DNA sequences (Zuckerkandl and Pauling, 1965). Gene mutation is the primary cause of evolution, so utilizing the universal carrier of genetic information as the characteristic by which phylogenetic comparison is made makes sense. This approach has significant advantages over the classical approach in which morphologi- cal and physiological characteristics are used. This is exemplified by the discovery of a third branch of life, the archaea, which have no substantial morphological or physiological differences to other prokaryotes. Archaea were discovered to be a separate domain of life by analyzing small subunit ribosomal RNAs (SSU rRNA) (Woese and Fox, 1977). While studying phylogeny using DNA sequence data has proven very successful, it has its limitations. Since individual genes have different evolutionary rates in different lineages, phylogenies built from individual genes do not always agree (Doolittle, 1995a; Doolittle, 1995b). As a consequence, although efforts have been made to generate a universal tree of life (Woese, 1998a; Woese, 1998b; Forterre and Philippe, 1999), many parts of the tree are still unresolved and highly debated, especially at the root of the tree where the three superkingdoms—archaea, bacteria, and eukaryotes—diverged (Mayr, 1998; Woese, 1998a; Woese, 1998b). Advances in large-scale sequencing technology enable the acquisition of the complete genome of an organism. There are currently hundreds of complete genomes available from Structural Bioinformatics, Second Edition Edited by Jenny Gu and Philip E. Bourne Copyright Ó 2009 John Wiley & Sons, Inc. 561

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

23

EVOLUTION STUDIED USING PROTEINSTRUCTURE

Song Yang, Ruben Valas, and Philip E. Bourne

One of the principle goals of evolutionary biology is to generate phylogeny that best

represents the evolutionary histories of all organisms on earth. Aside from directly

investigating the fossil records of ancestor species, all phylogenetic methods depend on

the comparison of specific features (homologous characteristics) of contemporary organ-

isms to determine the evolutionary relationships between different organisms. Among the

features are morphological, physiological, genetic, and genomic which changed as the

organisms evolved. The study of evolution changed dramaticallywith the discovery ofDNA

and the evolutionaryfingerprint it represents. Evolutionary relationships betweenorganisms

can be studied by comparing their DNA sequences (Zuckerkandl and Pauling, 1965). Gene

mutation is the primary cause of evolution, so utilizing the universal carrier of genetic

information as the characteristic by which phylogenetic comparison is made makes sense.

This approach has significant advantages over the classical approach in which morphologi-

cal and physiological characteristics are used. This is exemplified by the discovery of a third

branch of life, the archaea, which have no substantial morphological or physiological

differences to other prokaryotes. Archaeawere discovered to be a separate domain of life by

analyzing small subunit ribosomal RNAs (SSU rRNA) (Woese and Fox, 1977).

While studying phylogeny using DNA sequence data has proven very successful, it has

its limitations. Since individual genes have different evolutionary rates in different lineages,

phylogenies built from individual genes do not always agree (Doolittle, 1995a;

Doolittle, 1995b). As a consequence, although efforts have been made to generate a

universal tree of life (Woese, 1998a; Woese, 1998b; Forterre and Philippe, 1999), many

parts of the tree are still unresolvedandhighly debated, especially at the root of the treewhere

the three superkingdoms—archaea, bacteria, and eukaryotes—diverged (Mayr, 1998;

Woese, 1998a; Woese, 1998b).

Advances in large-scale sequencing technology enable the acquisition of the complete

genome of an organism. There are currently hundreds of complete genomes available from

Structural Bioinformatics, Second Edition Edited by Jenny Gu and Philip E. BourneCopyright � 2009 John Wiley & Sons, Inc.

561

Page 2: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

species across the tree of life. The growing numbers of genomes being sequenced have led to

a new field of research, phylogenomics, where not only one or a few gene sequences, but the

whole genomes of different organisms are compared and used as metrics for phylogenetic

inference (Delsuc et al., 2005; Snel et al., 2005). The complete genomes do not only contain

the primary sequences of genes; functional sites in the noncoding regions and the overall

genome structure are also under evolutionary pressure and can potentially be used as

comparative features. Thesewhole genome features include gene content—the numbers and

types of genes found in a genome (Snel et al., 1999; Tekaia et al., 1999; Lin and

Gerstein, 2000; House and Fitz-Gibbon, 2002; Wolf et al., 2002), and gene order—the

relative position of genes on the chromosomes (Dandekar et al., 1998; Wolf et al., 2001).

These approaches are able to recover the three superkingdoms of life and verify the main

groupings of the SSU rRNA tree, regardless of the potential prevailing horizontal gene

transfer (HGT) events. However, incongruence still exists, and the search for new homolo-

gous features and tree construction algorithms continues (Philippe et al., 2005).

STRUCTURES AS EVOLUTIONARY UNITS

The three-dimensional structures of proteins have different evolutionary rates than the

sequences fromwhich they are derived.As described in detail in previous chapters, structures

are more conserved than sequences. Changes in sequence can be tolerated provided they do

not perturb the physical and chemical properties that define secondary structure and the

organization of those secondary structures into tertiary units. It is not surprising then that

there are examples of proteins with no apparent sequence similarity and sometimes with no

apparent functional relationship that can have almost identical 3D structures (Pastore and

Lesk, 1990; Flaherty et al., 1991; Schnuchel et al., 1993). This feature of protein structure

makes it a better evolutionary marker than sequence to recognize more distant ancestral

relationships, provided divergent evolution can be separated from convergent evolution.

Convergent evolution implies there is no ancestral relationship, just convergence on a stable

structural arrangement. Given that structure infers distant evolutionary relationships, how do

we recover those relationships? The answer lies in protein domains.

As discussed in Chapter 2, the basic elements of protein structure are protein

domains, which are compact and spatially distinct parts of a protein that can fold

independently of neighboring sequences (Branden and Tooze, 1999). Each domain has its

own unique 3D structure (also called fold) and corresponds to a series of amino acid

sequences that can fold into the domain structure. Protein domains are the building

blocks of proteins; combinations of different numbers and types of domains form

structurally complex proteins with novel functions. Sharing of common domains by

different proteins may infer common ancestry.

It is widely accepted that the number of protein folds is limited; estimates of absolute

number range from 1,000 to 10,000 (Zhang, 1997; Govindarajan et al., 1999; Wolf

et al., 2000), a remarkably small number given the almost infinite possible number of

sequences. Intuitively then, the emergence of a new fold might constitute a significant

evolutionary event. It is timely that structure-basedevolutionary studies are being enabledby

the structural genomics initiative (Chapter 40)which aims to solve3D structures covering all

unique folds and thus provide a complete viewof protein structure space (Burley et al., 1999;

Chandonia and Brenner, 2006). Although some scientists would argue that this goal will

remain elusive (Xie and Bourne, 2005; Marsden et al., 2007).

562 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 3: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Protein domains are not only structural units, but also evolutionary units. Combinations

of different domains and domain duplication are the major evolutionary processes in the

acquisition of novel functions (Doolittle, 1995a; Doolittle, 1995b). During evolution, novel

domains can evolve bymeans of randommutation. Domains can also be lost or horizontally

transferred. Therefore, protein structural domains can be used as structural features to study

the evolutionary history of organisms.

Previous phylogenetic methods were based on the primary sequence of the DNA or

protein, not the 3D structural features of proteins. Part of the reason is that 3D structural

information is so limited; there are less than 50,000 structures in the PDB (Berman

et al., 2000) as of 2007 and those are highly redundant with respect to sequence and

structure. Unlike sequence and genomic data that can be generated by high-throughput

techniques, structure determination, structure genomics notwithstanding, is still relatively

slow. However, the accumulation of complete genome sequences and advances in gene

finding and homologymodeling algorithmsprovide an alternative approach to determine the

structure content of an organismon a genome-wide scale. As discussed in previous chapters,

using domains with existing 3D structures as templates, all the homologous domains in

complete genomes are found by sequence comparison methods. Although whole genome

domain recognition relies on sequence similarity, the resulting protein domain content,

nonetheless, inherits 3D structure information whose evolution is conserved beyond

sequence. Current techniques can reliably assign protein domains that cover over 50% of

the genome for a given organism, making it possible to study evolution through structure

(Buchan et al., 2002; Gough and Chothia, 2002).

PHYLOGENY BY PROTEIN DOMAIN CONTENT

Pioneering work on the study of evolution utilizing genomic structural information

originated with Gerstein (1997), when only one species from each of the three super-

kingdoms had been sequenced. Using a fold recognitionmethod, it was possible to annotate

only 10–20% of the genome, yet this attempt successfully showed that the approach of

studying evolution using structure held much promise. Work in this area continued as more

and more 3D structures became available and sequence comparison algorithms became

more sophisticated (Wolf et al., 1999; Caetano-Anolles and Caetano-Anolles, 2003).

Simultaneously, since the completion of the human genome project in 2001, there has

been an increase in the number of the complete genomes fromawide spectrumof organisms.

As of July 2004, 212 complete genomes (20 archaea, 154 bacteria, and 38 eukaryotes) were

available. At the same time, the number of 3D structures of proteins increased, resulting in

approximately 800 unique folds and 1300-fold superfamilies according to SCOP (Murzin

et al., 1995). Structural annotation of these complete genomes was performed by automatic

homology search algorithms (such as hidden Markov models), and put in public databases,

such as superfamily (Apic et al., 2001) and Gene3D (Buchan et al., 2002). These databases

contain the number, type and position of protein domains along the chromosomes for every

completed genome. Protein domain content data could thus be used to study phylogeny.

Similar to gene-content methods, the method based on protein domain content recon-

struct phylogenetic trees from the distance between organisms, where the distance is

calculated from the proportion of shared protein domains between genomes (Yang

et al., 2005). A neighbor-joining (NJ) phylogenetic tree for a total of 174 taxa (19 archaea,

119 bacteria, and 36 eukaryotes) readily groups all organisms into the three superkingdoms

PHYLOGENY BY PROTEIN DOMAIN CONTENT 563

Page 4: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

with high bootstrap values, in accordance with the phylogeny based on the comparison of

SSU rRNA (Figure 23.1). The major phyla within each superkingdom were also recovered,

although better phylogenies were generated when the taxa were restricted to a single

superkingdom.

The NJ tree of 36 eukaryotes based on the presence and absence of SCOP fold

superfamilies contained the major clades, animals, plant, and fungi. The groupings within

each clade were mostly correct. The 19 archaea genomes were readily divided into the 4

Crenarchaeota and 15 Euryarchaeota. The small-genomed and enigmatic Nanoarchaeum

equitans, reportedly the only known archaeal parasite, appeared near the root of the branch

leading to the Pyrococci. This method was able recover the monophyly of most of the

principle bacterial groups defined by classical taxonomy, including Actinobacteria, Cya-

nobacteria, Proteobacteria, Frimicutes, and Chlamydiae, and so on, although the relation-

ship between groups was still unresolved. One major anomaly was the grouping of parasitic

organisms that have extremely reduced genomes. This so called ‘‘big genome attraction’’

artifact (Lake and Rivera, 2004) is caused by massive gene loss in certain genomes that

induce great differences in genome size and gene content among bacteria. When an

empirical weighting factor aimed to compensate for the artifact was used, the small-genome

Figure 23.1. Overall phylogeny (neighbor-joining) of 174organisms forwhich completegenomes

have been determined. Bootstrap number was limited to the major branch points.

564 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 5: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

taxa retreated to their correct locations along with their full-sized counterparts (Yang

et al., 2005).

In this approach, the mere presence or absence of protein domains in genomes more

accurately reconstructed most of the phylogenies than that constructed using the overall

abundance of each domain in a genome. Domain abundance is greatly affected by gene and

chromosome duplication, which is a contributor to the evolutionary distance between

genomes, yet not a uniform process. As a result, excessive duplication can lead to inflated

distances that mask the more crucial differences in the form of gain or loss of individual

domains. The protein domain content of a given genome is changed whenever (i) a new fold

evolves during a long-termdivergence, (ii) a fold is lost as a result of deletion of all or part of a

gene, or (iii) a new fold is acquired by horizontal transfer. Ordinarily, gene duplication on its

own does not give rise abruptly to new folds.

Why is protein domain content better than gene content in constructing phylogenies?

One answer is that proteins (gene products) are modular, and many of them are mosaics of

different domains. Indeed, duplicated and/or shuffled domains are fundamental to establish-

ing functional diversity. Genes may be retained even when the domain content changes and

vice versa. Certainly protein domain contentmeasures evolutionary change differently from

gene content.

There are other intrinsic advantages in using the simple presence or absence of a

structural attribute for phylogenetic purposes. For one, there is less concern about mistaken

paralogy as so often occurs when comparing protein sequences. Moreover, the rate of

sequence change and its attendant problems of site-specific variation do not play a role and

arbitrary decisions about gene designation and function are not issues. Lastly, as we have

seen above, three-dimensional structures are more highly conserved than primary se-

quences, allowing one to see further into the evolutionary past.

In summary, a simple scheme that uses only the presence or absence of protein domains

in genomeswas able to reconstruct the phylogenetic relationship of 174 organisms spanning

the tree of life, achieving comparable results to those methods using sophisticated sequence

analysis and/or combinations of gene content and gene order. This approach demonstrated

that structural information can be used to study evolution.

Variations of this approach have also been tested by other groups. Instead of SCOP, the

Pfam protein domain and family collection, based purely on sequence comparison has been

used (Bentley and Parkhill, 2004). In addition to the presence or absence of single domains,

combinations of domains or domain organizations within a protein are considered as

structural attributes to reconstruct the tree of life (Wang and Caetano-Anolles, 2006;

Fukami-Kobayashi et al., 2007). In general these methods achieve comparable results to

those methods based on protein domain content.

THE LAST UNIVERSAL COMMON ANCESTOR (LUCA)

In 1977 Carl Woese used 16s rRNA to show cellular life can be divided into three

superkingdoms: the eukaryotes, bacteria, and archaea (Woese and Fox, 1977). The archaea

were initially believed to be the oldest superkingdom. Thirty years later the relationship

between these three groups still remains open to debate. At the heart of this debate lies the

origin of the eukaryotes, which appear to be a mix of archaea and bacteria. There are

numerous theories for the origin of the eukaryotes (reviewed in (Embley andMartin, 2006)).

Most of these theories agree that the eukaryotes originated from a symbiosis of an archaea

THE LAST UNIVERSAL COMMON ANCESTOR (LUCA) 565

Page 6: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

and an a-proteobacteria (the mitochondrial ancestor). They disagree on what the archaea

was and where the archaea descend from the bacteria. The sequencing of hundreds of

genomes was supposed to generate enough sequence data to decipher the relationship

between the three superkingdoms, but the debate continues.

Work based upon domain combinations has further fuelled the debate; some trees show

the eukaryotes are more closely related to the bacteria than archaea (Wang and Caetano-

Anolles, 2006), while others show the archaea and eukaryotes are sister clades (Fukami-

Kobayashi et al., 2007). Despite this disagreement structure holds promise for ending this

debate. The mitochondrial transfer has left a stronger sequence signal than the archaeal

ancestor of the eukaryotes (Pisani et al., 2007), so sequence based methods will continue to

be biased towards results which place the eukaryotes closer to the bacteria. Structure, on the

contrary, is not biased in this way.

It has recently been argued that trees are the wrong representation for evolutionary

histories given the large number of horizontal transferswithin the prokaryotes (Doolittle and

Bapteste, 2007). The assumption of a tree structure may be the reason the relationship

between the three superkingdoms is not clear. If we use superfamilies as structure repre-

sentatives, there are superfamilies that the eukaryotes acquired frommitochondria and some

that they acquired from archaea. This moves the eukaryotes closer to both prokaryotic

superkingdomsina tree,butblurswhichbacteriaandarchaeacontributedsuperfamilies to the

eukaryotes. In a network representation, the eukaryotic rootwould have twomajor branches;

folds contributed from the archaea and others from the bacteria. Horizontal transfer of

superfamilies will have some effect on tree reconstruction, but this effect should be lessened

byusing structure insteadof sequence data. Sometimes the transfer of superfamilies does not

matter because the receiving species already had a copy of that superfamily in another gene.

This becomes less of a problem when dealing with domain combinations, as it is less likely

that a particular domain combination is already in the recipient genome.

Previous studies have attempted to predict thegene content of the last universal common

ancestor (LUCA) (Ouzounis et al., 2006), here we present findings of a different approach

using structural information. Since structure is more conserved than sequence it should be

easier to construct LUCA�s domain content than gene content. Obviously any structural

domains that are universal across cellular life were in LUCA. There are about 40 totally

universal superfamilies (Yang et al., 2005). Many genes that were present in LUCA have

been lost in many species so the universal set would greatly undercount LUCA�s domain

content. A recent study estimated that LUCA had at minimum 140 superfamilies (Ranea

et al., 2006). They considered a superfamily to be in LUCA if it was in 90% of all extant

species, and in at least 70% of the archaea and 70% of the eukaryotes.

The problem with this approach is the lack of a defined relationship between the three

superkingdoms. A domain�s presence in the archaea and eukaryotes is irrelevant to that

domain being in LUCA if both of these superkingdoms are derived from bacteria. Any

proposed root for the tree of life infers a LUCA fold set. For example, if LUCA was

chloroflexus like, then LUCA�s fold set can be accurately estimated using parsimony

methods and a tree of the chloroflexus species. Folds that were present in LUCA are not

necessarily essential. LUCAcould not live as a parasite because therewould be no host for it,

so parasitic bacteria are free to lose structures that were essential in LUCA.Without careful

assumptions about structure loss and the relationship of the three superkingdoms anyLUCA

fold set will be misleadingly small. LUCA probably had a wide repertoire of protein

superfamilies, which infers a significant amount of protein evolution occurred before the

emergence of any of the three superkingdoms.

566 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 7: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

ANCIENT GEOCHEMICAL ENVIRONMENT REFLECTED BY THE MODERNSTRUCTURE REPERTOIRE

During evolution the genesis of novel protein domain was constrained by the geochemical

environment of that moment, such as the temperature, pH, redox environment, element

availability, and so on. It has recently been suggested that these constraints are reflected in

some structural features of proteins and observed in the current structure repertoire.

Disulfide bonds are covalent bonds formed between two cysteine residues, which, in

addition to other intramolecular interactions, can stabilize structural domains and contribute

to the variability of structural space. However, the disulfide bond itself is volatile under

reducing conditions. Since the oxygen content of the earth�s atmosphere was gradually

increasing during evolution, we can expect a correlation between the emergence of disulfide

bond-dependent domains and the evolution of the earth�s environment (Yang, 2007).

The divergence of the three superkingdomswas estimated to be at about 1.8–2.2 billion

years ago (Doolittle et al., 1996), which is approximately the timewhen the oxygen content

became significant. The emergence of disulfide bond-containing domains can be illustrated

by a Venn diagram that contains the numbers and percentages of disulfide bond-containing

domains in each superkingdom (Figure 23.2). Only 4.7% of the folds common to all

superkingdoms contain disulfide bonds,whichmay have originated before the divergence of

the three superkingdoms. By contrast, 31.9% of the domains unique to eukaryotes are

disulfide bond-containing domains. The result largely confirms that most folds containing

disulfide bonds formed after the oxygen level had increased in the atmosphere.

Figure 23.2. Numbers and percentages of disulfide bond-dependent domains in each super-

kingdom. The two numbers in the parenthesis are the number of disulfide bond-containing fold

and total foldwithin each region, respectively. SCOP (version1.63) contains 765-folds, ofwhich 708-

folds are found in at least one species in this study, and the remaining 57-folds are mainly virus-

related.

ANCIENT GEOCHEMICAL ENVIRONMENT REFLECTED 567

Page 8: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Another example of geochemical properties that influenced the evolution of proteins

comes in the form of metal ions that are incorporated into a protein�s 3D structure (Dupont

et al., 2006). Many protein domains require the existence of metal ions, such as Fe, Zn, Mn,

and so on, in order to fold and function properly. Sometimes, the same protein fold can

accommodate different typesofmetal ions. Similar to thedisulfidebonds, the emergenceand

evolution of those metal-containing domains are expected to correlate with changes in ion

availability and redox state within the ancient ocean. This is seen in the distribution of

modern metal-containing domains in whole taxa. In a study by Dupont et al. (2006), a

correlation was observed between the proportion ofmetal-containing proteins in each of the

three superkingdoms and the tracemetal bioavailability in the ancient ocean at the time each

superkingdom emerged. Overall this indicates amajor evolutionary shift in biological metal

ion usage, both in how a specificmetal ion is used and whichmetal ion is used. For instance,

eukaryotes have significantlymore folds that incorporate zinc, which is consistent with their

evolution in amore oxygen-rich and hence zinc-rich environment. Likewise, iron hasmoved

from a predominance of iron-sulfur clusters, stable in the redox conditions of early earth, to

heme-like structures more stable under oxidizing conditions. It is interesting to note that

while the earth�s geochemistry and life on earth are to some degree symbiotic, these subjects

are rarely studied together.

THE EVOLUTIONARY HISTORY OF PROTEIN DOMAINS

As evolutionary units used in a very granular way, protein structural domains are able to

decipher phylogenetic relationships among species in the tree of life, address questions

concerning the origin of the three superkingdoms and explore the impact of the ancient

geochemical environment on evolution. At a more detailed level, the evolution of each

individual proteindomainand the formationof theprotein domain repertoire are alsoofgreat

interest. The evolutionary origins of protein domains, identification of protein domain loss,

transfer, duplication and combination with other domains to form new proteins make up the

history of protein domains. The influence of the evolution of domains on the evolution of

proteins and functions as well as the organisms as a whole, remain fundamental and

challenging topics in evolutionary biology.

The evolution of protein domains consists of two different but related aspects: the

changes in protein domains themselves, and the presence or absence of domain and domain

combinations in individual genomes. The former includes important events such as the

innovation of new domains and the gradual changes in the sequences, structures and

functions of protein domains during evolution. Although structure is more conserved than

sequence, during long term evolution, local insertions/deletions/substitutions, circular

permutations, and rearrangements can gradually change the structure and give rise to

protein folds with more structural variation, as indicated in Grishin�s work on the gradual

evolutionary path between different structures (Grishin, 2001).

Scheeff and Bourne investigated the evolutionary history of the protein kinase-like

superfamily, which contains a variety of kinases that phosphorylate different substrates and

play important roles in all three superkingdoms of life (Scheeff and Bourne, 2005). The

comparison of the superfamily through a structural alignment revealed a ‘‘universal core’’

domain consisting only of regions required for ATP binding and the phosphotransfer

reaction. Substantial structural and sequence revisions over long evolutionary timescales,

mainly to accommodate different substrates, were identified and used to construct a

568 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 9: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

phylogenetic tree of the protein kinase-like superfamily. Surprisingly, it was found that

atypical protein kinases (those that phosphorylate nonproteins) emerged early as a group and

were not derived through divergence from individual typical protein kinase families.

Structural features such as conserved hydrogen bonding patterns found in the analyzed

kinases amidst significant rearrangements of secondary structural components, provided

insights not achievable from sequence alone.

Evolutionary events such as the duplication of a domain or domain combination in a

genome, domain loss, domain transfers between species,will change the genomic content of

domains or domain combinations, but not their identities. In the last 10 years, with the

accumulationof completegenomes and the improvementof homologydetectionalgorithms,

scientists have started to investigate the distribution of protein domains in the three

superkingdoms of life and other aspects of protein domain evolution (Doolittle, 1995a;

Doolittle, 1995b; Copley et al., 2002; Koonin et al., 2002; Ponting and Russell, 2002).

Cyrus Chothia, Sarah Teichmann and their colleagues investigated the existence of

multi-domain proteins in the three superkingdoms of life and estimated that 2/3 of

prokaryote proteins have two or more domains, whereas 4/5 of proteins in eukaryotes are

multi-domain (Teichmannet al., 1998). In 2001, the investigationofdomain combinations in

40 genomes showed a power-law distribution of the tendency of domains to form combina-

tions (Apic et al., 2001); some two-domain or three-domain combinations frequently recur

in different protein contexts, which were called ‘‘supra-domains’’ (Vogel et al., 2004). A

simulation of the processes of domain duplication and combination suggests that domain

combinations are stochastic processes followed by duplication to varying extents (Vogel

et al., 2005). During the evolution of domains, gene fusion is more common than gene

fission (Kummerfeld and Teichmann, 2005), and convergent evolution is a rare event

(Gough, 2005). A recent analysis suggested that the abundance of protein domains and

domain combinations are correlated with the complexity of the organism, as characterized

by the numbers of cell types each organism contains (Vogel and Chothia, 2006).

Christine Orengo and colleagues approach the topic of domain combination using the

CATH (Orengo et al., 1997; Orengo et al., 2002) protein classification scheme and the

Gene3D genomic domain assignment database (Chapter 22). They found a correlation

between domain abundance and genome size (Ranea et al., 2004). The abundances of some

domains within a genome are independent of genome size, others are proportional, and still

others are nonlinearly distributed. Each type of abundance is roughly correlated with

function, namely: protein translation and biosynthesis; metabolism; and gene regulation,

respectively. For further reading on domain rearrangement thework of Arne Elfsson and his

group is a good place to start: for domain recombination (Bjorklund et al., 2005; Ekman et

al., 2005); and for duplication (Bjorklund et al., 2006).

FILLING IN FOLD SPACE: CURRENT LIMITATIONS

Many of the methods discussed in this chapter rely on the ability to predict the presence of

fold superfamilies in a genome. The superfamily database only predicts superfamilies for

which there is a solved structure, so coverage is limited. For example, the human genomehas

at least one domain assignment for 63% of its proteins, but only 38% of complete protein

sequence is assigned (many proteins have multiple domains). The unassigned regions

are considered to be gaps. Methods that use domain combinations are going to be less

accurate due to the large number of gaps. The key to improving trees built from this data is to

FILLING IN FOLD SPACE: CURRENT LIMITATIONS 569

Page 10: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

decrease the number of domain combinations that have gaps. Recent work has argued that

SCOPalready covers thevastmajority of fold space (Levitt, 2007). It is possible thatmanyof

these gaps represent known superfamilies whose sequences are too far from any known

structure. However, the PDB and SCOP are both biased towards structures that can be

crystallized and aremissingmanymembrane folds. Therefore the superfamily databasewill

continue to be unable to assign large portions of the genome as it is unlikely that structures of

new superfamilies to fill these gaps are going to be solved in the near future. The Pfam

database (Finn et al., 2006) based on sequence has much better coverage, but lacks distant

homologues as seen from structure. Trees built from a mixture of superfamily and Pfam

domains may be a good solution for furthering our understanding of evolution.

CONCLUSION

Protein domains are structural building blocks that define the function of a protein through

their various combinations; they are also the units that encode the evolutionary history of

proteins as well as genomes that contain them. Illustrating the entire evolutionary history of

each protein domain, from the genesis of a new domain or domain combination to the loss

and transferof adomain fromaspecificgenome, can further ourunderstandingof someof the

fundamental unsolved problems in evolution, such as events in the early evolution of life.

REFERENCES

Apic G, Gough J, et al. (2001): Domain combinations in archaeal, eubacterial and eukaryotic

proteomes. J Mol Biol 310(2):311–325.

Bentley SD, Parkhill J (2004): Comparative genomic structure of prokaryotes. Annu Rev Genet

38:771–792.

Berman HM, Westbrook J, et al. (2000): The protein data bank. Nucleic Acids Res 28(1):235–242.

Bjorklund AK, Ekman D, et al. (2005): Domain rearrangements in protein evolution. J Mol Biol

353(4):911–923.

BjorklundAK, EkmanD, et al. (2006): Expansion of protein domain repeats. PLoSComput Biol 2(8):

e114.

Branden C-I, Tooze J (1999): Introduction to Protein Structure. New York: Garland Publishing.

Buchan DW, Shepherd AJ, et al. (2002): Gene3D: structural assignment for whole genes and genomes

using the CATH domain structure database. Genome Res 12(3):503–514.

Burley SK, Almo SC, et al. (1999): Structural genomics: beyond the human genome project. Nat

Genet 23(2):151–157.

Caetano-Anolles G, Caetano-Anolles D (2003): An evolutionarily structured universe of protein

architecture. Genome Res 13(7):1563–1571.

Chandonia JM, Brenner SE (2006): The impact of structural genomics: expectations and outcomes.

Science 311(5759):347–351.

Copley RR, Doerks T, et al. (2002): Protein domain analysis in the era of complete genomes. FEBS

Lett 513(1):129–134.

Dandekar T, Snel B, et al. (1998): Conservation of gene order: a fingerprint of proteins that physically

interact. Trends Biochem Sci 23(9):324–328.

Delsuc F,BrinkmannH, et al. (2005): Phylogenomics and the reconstruction of the tree of life. NatRev

Genet 6(5):361–375.

570 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 11: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Doolittle RF (1995a): The multiplicity of domains in proteins. Annu Rev Biochem 64:287–314.

Doolittle RF (1995b): Of archae and eo:what�s in a name?ProcNatl Acad Sci U SA 92(7):2421–2423.

Doolittle RF, Feng DF, et al. (1996): Determining divergence times of the major kingdoms of living

organisms with a protein clock. Science 271(5248):470–477.

DoolittleWF, Bapteste E (2007): Pattern pluralism and the tree of life hypothesis. Proc Natl Acad Sci

U S A 104(7):2043–2049.

DupontCL,Yang S, et al. (2006):Modern proteomes contain putative imprints of ancient shifts in trace

metal geochemistry. Proc Natl Acad Sci U S A 103(47):17822–17827.

Ekman D, Bjorklund AK, et al. (2005): Multi-domain proteins in the three kingdoms of life: orphan

domains and other unassigned regions. J Mol Biol 348(1):231–243.

Embley TM, Martin W (2006): Eukaryotic evolution, changes and challenges. Nature 440 (7084):

623–630.

Finn RD, Mistry J, et al. (2006): Pfam: clans, web tools and services. Nucleic Acids Res 34(Database

issue):D247–D251.

Flaherty KM,McKay DB, et al. (1991): Similarity of the three-dimensional structures of actin and the

ATPase fragment of a 70-kDa heat shock cognate protein. Proc Natl Acad Sci U S A 88(11):

5041–5045.

Forterre P, Philippe H (1999): Where is the root of the universal tree of life? Bioessays 21(10):

871–879.

Fukami-Kobayashi K, Minezaki Y, et al. (2007): A tree of life based on protein domain organizations.

Mol Biol Evol 24(5):1181–1189.

Gerstein M (1997): A structural census of genomes: comparing bacterial, eukaryotic, and archaeal

genomes in terms of protein structure. J Mol Biol 274(4):562–576.

Gough J, Chothia C (2002): SUPERFAMILY: HMMs representing all proteins of known structure.

SCOP sequence searches, alignments and genome assignments. Nucleic AcidsRes 30 (1):268–272.

Gough J (2005): Convergent evolution of domain architectures (is rare). Bioinformatics 21(8):

1464–1471.

GovindarajanS,RecabarrenR,etal. (1999):Estimatingthetotalnumberofproteinfolds. Proteins35(4):

408–414.

Grishin NV (2001): Fold change in evolution of protein structures. J Struct Biol 134(2–3):167–185.

House CH, Fitz-Gibbon ST (2002): Using homolog groups to create a whole-genomic tree of free-

living organisms: an update. J Mol Evol 54(4):539–547.

Koonin EV,WolfYI, et al. (2002): The structure of the protein universe and genome evolution. Nature

420(6912):218–223.

Kummerfeld SK, Teichmann SA (2005): Relative rates of gene fusion and fission in multi-domain

proteins. Trends Genet 21(1):25–30.

Lake JA, Rivera MC (2004): Deriving the genomic tree of life in the presence of horizontal gene

transfer: conditioned reconstruction. Mol Biol Evol 21(4):681–690.

Levitt M (2007): Growth of novel protein structural data. Proc Natl Acad Sci U S A 104(9):

3183–3188.

Lin J, Gerstein M (2000): Whole-genome trees based on the occurrence of folds and orthologs:

implications for comparing genomes on different levels. Genome Res 10(6):808–818.

Marsden RL, Lewis TA, et al. (2007): Towards a comprehensive structural coverage of completed

genomes: a structural genomics viewpoint. BMC Bioinform 8:86.

Mayr E (1998): Two empires or three? Proc Natl Acad Sci U S A 95(17):9720–9723.

Murzin AG, Brenner SE, et al. (1995): SCOP: a structural classification of proteins database for the

investigation of sequences and structures. J Mol Biol 247(4):536–540.

REFERENCES 571

Page 12: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Orengo CA,Michie AD, et al. (1997): CATH–a hierarchic classification of protein domain structures.

Structure 5(8):1093–1108.

Orengo CA, Bray JE, et al. (2002): The CATH protein family database: a resource for structural and

functional annotation of genomes. Proteomics 2(1):11–21.

Ouzounis CA, Kunin V, et al. (2006): A minimal estimate for the gene content of the last

universal common ancestor–exobiology from a terrestrial perspective. Res Microbiol 157(1):

57–68.

Pastore A, Lesk AM (1990): Comparison of the structures of globins and phycocyanins: evidence for

evolutionary relationship. Proteins 8(2):133–155.

Philippe H, Delsuc F, et al. (2005): Phylogenomics. Annu Rev Ecol Evol Syst 36:541–562.

Q1 Pisani D, Cotton JA, et al. (2007): Supertrees disentangle the chimeric origin of eukaryotic genomes.

Mol Biol Evol.

Ponting CP, Russell RR (2002): The natural history of protein domains. Annu Rev Biophys Biomol

Struct 31:45–71.

Ranea JA, Buchan DW, et al. (2004): Evolution of protein superfamilies and bacterial genome size.

J Mol Biol 336(4):871–887.

Ranea JA, Sillero A, et al. (2006): Protein superfamily evolution and the last universal common

ancestor (LUCA). J Mol Evol 63(4):513–525.

Scheeff ED, Bourne PE (2005): Structural evolution of the protein kinase-like superfamily. PLoS

Comput Biol 1(5):e49.

Schnuchel A, Wiltscheck R, et al. (1993): Structure in solution of the major cold-shock protein from

Bacillus subtilis. Nature 364(6433):169–171.

Snel B, Bork P, et al. (1999): Genome phylogeny based on gene content. Nat Genet 21(1):108–110.

Snel B, Huynen MA, et al. (2005): Genome trees and the nature of genome evolution. Annu Rev

Microbiol 59:191–209.

Teichmann SA, Park J, et al. (1998): Structural assignments to the Mycoplasma genitalium proteins

show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 95(25):

14658–14663.

Tekaia F, Lazcano A, et al. (1999): The genomic tree as revealed from whole proteome comparisons.

Genome Res 9(6):550–557.

Vogel C, Berzuini C, et al. (2004): Supra-domains: evolutionary units larger than single protein

domains. J Mol Biol 336(3):809–823.

Vogel C, Teichmann SA, et al. (2005): The relationship between domain duplication and recombina-

tion. J Mol Biol 346(1):355–365.

Vogel C, Chothia C (2006): Protein family expansions and biological complexity. PLoS Comput Biol

2(5):e48.

Wang M, Caetano-Anolles G (2006): Global phylogeny determined by the combination of protein

domains in proteomes. Mol Biol Evol 23(12):2444–2454.

Woese CR, Fox GE (1977): Phylogenetic structure of the prokaryotic domain: the primary kingdoms.

Proc Natl Acad Sci U S A 74(11):5088–5090.

Woese CR (1998a): The universal ancestor. Proc Natl Acad Sci U S A 95 (12):6854–6859.

Woese CR (1998b): Default taxonomy: ErnstMayr�s view of themicrobial world. Proc Natl Acad Sci

U S A 95(19):11043–11046.

Wolf YI, Brenner SE, et al. (1999): Distribution of protein folds in the three superkingdoms of life.

Genome Res 9(1):17–26.

Wolf YI, Grishin NV, et al. (2000): Estimating the number of protein folds and families from complete

genome data. J Mol Biol 299(4):897–905.

572 EVOLUTION STUDIED USING PROTEIN STRUCTURE

Page 13: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Wolf YI, Rogozin IB, et al. (2001): Genome trees constructed using five different approaches suggest

new major bacterial clades. BMC Evol Biol 1:8.

Wolf YI, Rogozin IB, et al. (2002): Genome trees and the tree of life. Trends Genet 18(9):472–479.

Xie L, Bourne PE (2005): Functional coverage of the human genome by existing structures, structural

genomics targets, and homology models. PLoS Comput Biol 1(3):e31.

Yang S, Doolittle RF, et al. (2005): Phylogeny determined by protein domain content. Proc Natl Acad

Sci U S A 102(2):373–378.

Yang S (2007): Evolution Studied Through Protein Structural Domains. San Diego: Department of

Chemistry and Biochemistry, La Jolla, University of California, p 174.

Zhang CT (1997): Relations of the numbers of protein sequences, families and folds. Protein Eng

10(7):757–761.

Zuckerkandl E, Pauling L (1965): Molecules as documents of evolutionary history. J Theor Biol 8(2):

357–366.

REFERENCES 573

Page 14: EVOLUTION STUDIED USING PROTEIN · own unique 3D structure (also called fold) and corresponds to a series of amino acid sequences that can fold into the domain structure. Protein

Author Query1. Please provide the volume no. and the page range in reference: Pisani et al., 2007.