(nas colloquium) computational biomolecular science

116

Upload: others

Post on 11-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (NAS Colloquium) Computational Biomolecular Science
Page 2: (NAS Colloquium) Computational Biomolecular Science

COLLOQUIUM ONCOMPUTATIONAL

BIOMOLECULAR SCIENCE

NATIONAL ACADEMY OF SCIENCESWASHINGTON, D.C.

1998

i

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 3: (NAS Colloquium) Computational Biomolecular Science

NATIONAL ACADEMY OF SCIENCES

Colloquium SeriesIn 1991, the National Academy of Sciences inaugurated a series of scientific colloquia, five or six of which are scheduled each year under

the guidance of the NAS Council’s Committee on Scientific Programs. Each colloquium addresses a scientific topic of broad and topicalinterest, cutting across two or more of the traditional disciplines. Typically two days long, colloquia are international in scope and bringtogether leading scientists in the field. Papers from colloquia are published in

NATIONAL ACADEMY OF SCIENCES ii

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 4: (NAS Colloquium) Computational Biomolecular Science

COMPLETED NAS COLLOQUIA

(1991 TO PRESENT)Industrial EcologyMay 20–21, 1991; Washington, D.C.Organizer: C.Kumar N.PatelProceedings: February 4, 1992Images of Science: Science of ImagesJanuary 13–14, 1992; Washington, D.C.Organizer: Albert CreweProceedings: November 3, 1993Physical CosmologyMarch 27–29, 1992; Irvine, CaliforniaOrganizer: David SchrammProceedings: June 3, 1993Molecular RecognitionSeptember 10–11, 1992; Washington, D.C.Organizer: Ronald BreslowProceedings: February 16, 1993Human-Machine Communication by VoiceFebruary 8–9, 1993: Irvine, CaliforniaOrganizer: Lawrence RabinerProceedings: October 24, 1995Changing Human Ecology and Behavior: Effects on Infectious DiseasesSeptember 27–28, 1993; Washington, D.C.Organizer: Bernard RoizmanProceedings: March 29, 1994The Tempo and Mode of EvolutionJanuary 27–29, 1994; Irvine, CaliforniaOrganizers: Francisco Ayala, Walter FitchProceedings: July 19, 1994Chemical Ecology: The Chemistry of Biotic InteractionMarch 25–26, 1994; Washington, D.C.Organizers: Thomas Eisner, Jerrold MeinwaldProceedings: January 3, 1995Physics: The Opening to ComplexityJune 25–27, 1994; Irvine, CaliforniaOrganizer: Philip AndersonProceedings: July 18, 1995

Self Defense by Plants: Induction and Signaling PathwaysSeptember 15–17, 1994; Irvine, CaliforniaOrganizers: André Jagendorf, Clarence RyanProceedings: May 9, 1995Earthquake PredictionFebruary 10–11, 1995; Irvine, CaliforniaOrganizer: Leon KnopoffProceedings: April 30, 1996Quasars and Active Galaxies: High Resolution Radio ImagingMarch 24–25, 1995; Irvine, CaliforniaOrganizers: Marshall Cohen, Kenneth KellermanProceedings: December 5, 1995Vision: From Photon to PerceptionMay 21–22, 1995; Irvine, CaliforniaOrganizers: John Dowling, Lubert Stryer, and Torsten WieselProceedings: January 23, 1996Science, Technology, and the EconomyOctober 20–22, 1995; Irvine, CaliforniaOrganizers: James Heckman, Ariel Pakes, and KennethSokoloffProceedings: November 12, 1996Developmental Biology of Transcription ControlOctober 25–28, 1995; Irvine, CaliforniaOrganizers: Roy Britten, Eric Davidson, and Gary FelsenfeldProceedings: September 3, 1996Carbon Dioxide and Climate ChangeNovember 13–15, 1995; Irvine, CaliforniaOrganizer: Charles KeelingProceedings: August 5, 1997Memory: Recording Experience in Cells and CircuitsFebruary 17–20, 1996; Irvine, CaliforniaOrganizer: Patricia Goldman-RakicProceedings: November 26, 1996

COMPLETED NAS COLLOQUIA iii

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 5: (NAS Colloquium) Computational Biomolecular Science

COMPLETED NAS COLLOQUIA

Elliptic Curves and Modular FormsMarch 15–17, 1996; Washington, D.C.Organizers: Barry Mazur, Karl RubinProceedings: October 14, 1997Symmetries Throughout the SciencesMay 10–12, 1996; Irvine, CaliforniaOrganizer: Ernest HenleyProceedings: December 15, 1996Genetic Engineering of Viruses and Viral VectorsJune 9–11, 1996; Irvine, CaliforniaOrganizers: Peter Palese, Bernard RoizmanProceedings: October 15, 1996Genetics and the Origin of SpeciesJanuary 30-February 1, 1997; Irvine, CaliforniaOrganizers: Francisco Ayala, Walter FitchProceedings: July 22, 1997The Age of the Universe: Dark Matter and Structure FormationMarch 21–23, 1997; Irvine, CaliforniaOrganizers: David Schramm, P.J.E.PeeblesProceedings: January 6, 1998Neuroimaging and Human Brain FunctionMay 29–31, 1997; Irvine, CaliforniaOrganizers: Michael Posner, Marcus RaichleProceedings: February 3, 1998Protecting Our Food Supply: The Value of Plant Genome InitiativesJune 2–4, 1997; Irvine, CaliforniaOrganizers: Michael Freeling, Ronald Phillips, John AxtellProceedings: March 5, 1998Computational Biomolecular ScienceSeptember 11–14, 1997; Irvine, CaliforniaOrganizers: Peter G.Wolynes, Russell Doolittle, J.A.McCammonProceedings: May 26, 1998A Library Approach to ChemistryOctober 19–21, 1997; Irvine, CaliforniaOrganizer: Peter Schultz, Jonathan Ellman

COMPLETED NAS COLLOQUIA iv

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 6: (NAS Colloquium) Computational Biomolecular Science

PROGRAM

Computational Biomolecular Science

Thursday, September 11, 1997Registration and Welcome ReceptionFriday, September 12, 1997Session I 8:45 AM-12:30 PM

Introduction, Peter Wolynes.Measuring genome evolution. Peer Bork (EMBL, Heidelberg).Determining biological function from sequence: Building highly specific sequencemotifs for genome analysis. Douglas Brutlag (Stanford).Experimental studies of protein folding dynamics. William Eaton (NIH).Coupling the folding of homologous proteins. Ron Elber (Hebrew University).

Chair, Russell Doolittle

Session II 2:00 PM-5:30 PMPhotoactive yellow protein: Prototype for the PAS domains of sensors and clocks.Elizabeth Getzoff (Scripps Research Institute).Inhomogeneities in genomic sequence composition. Philip Green (Univ.Washington).New refinement methods for NOE-distance based NMR structure. AngelaGronenborn (NIH).Estimation of evolutionary distances between DNA sequences. Wen-Hsiung Li(Univ. Texas, Houston).Comments by Roy BrittenAfter-dinner Lecture. From slide rule to super computer. Hans Frauenfelder (LosAlamos).

Chair, Andrew McCammon

Saturday, September 13, 1997Session III 9:00 AM-12:30 PM

Comparing sequence comparison with structure comparison. Michael Levitt(Stanford).Structural classification of proteins and its evolutionary implications. Alexey Murzin(MRC, Cambridge).Exploring the protein folding funnel landscape-connection to fast foldingexperiments. Jose Onuchic (UCSD).Bridged bimetallic enzymes: A challenge for computational chemistry. GregoryPetsko (Brandeis).

Chair, Andrew McCammon

Session IV 2:00 PM-5:30 PMSequence determinants of protein folding and stability. Robert Sauer (MIT).The evolution of efficient light harvesting in photosynthesis-one goal, manysolutions. Klaus Schulten (Illinois).Electrostatic steering and ionic tethering in simulations of protein-ligandinteractions. Rebecca Wade (EMBL, Heidelberg).Computer simulation of enzymatic reactions and other biological process; findingout what was optimized by evolution. Arieh Warshel (USC).After-dinner Lecture. Applications of computers in structural biology. HaroldScheraga (Cornell).

Chair, Peter Wolynes

PROGRAM v

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 7: (NAS Colloquium) Computational Biomolecular Science

LIST OF ATTENDEES

Computational Biomolecular Science

Robert K.Adair, Yale UniversityPaul A.Bash, Argone National LaboratoryR.L.Bernstein, San Francisco State UniversityPaul Beroza, CombiChem Inc.Peer Bork, European Molecular Biology LaboratoryDavid A.Brant, University of CaliforniaRoy J.Britten, California Institute of TechnologyThomas C.Bruice, University of California, Santa BarbaraDouglas Brutlag, Stanford University Medical SchoolAloke Chatterjee. Lawrence Berkeley National LaboratoryJiangang Chen, University of California, Los AngelesMargaret S.Cheung, University of California, San DiegoJulian D.Cole, Rensselaer Polytechnic InstituteKumari Devulapalle, University of Southern California, Schoolof DentistryRussel F.Doolittle, University of California, San DiegoWilliam Eaton, National Institutes of Health Ron Elber, HebrewUniversityAdrien Elcock, University of California, San DiegoHans Frauenfelder, Los Alamos National LaboratoryAnthony Gamst, University of California, San DiegoRobert Gerber, University of California, IrvineElizabeth D.Getzoff, Scripps Research InstituteRaveh Gill-More, Compugen Ltd.Adam Godzik, The Scripps Research InstituteJill E.Gready, Australian National UniversityPhillip Green, University of WashingtonAngela M.Gronenborn, National Institutes of HealthWilliam Grundy, University of California. San DiegoVolkhard Helms, University of California San DiegoDennis Kibler, University of California, IrvineRobert Konecny, The Scripps Research InstituteKristin Korethe, Smith Kline BeechamLeslie A.Kuhn, Michigan State UniversityDonald Kyle, Scios Inc.Peter W.Langhoff, San Diego Supercomputer CenterMichael Levitt, Stanford University, School of MedicineJian Li, The Scripps Research InstituteWen-Hsiung Li, University of TexasE.N.Lightfoot, University of WisconsinJennifer H.Y.Liu, University of CaliforniaHartmut Luecke, University of California, IrvineJia Luo, University of California, Santa Barbara

Zaida Luthey-Schultem, University of IlinoisJeffry D.Madura, University of South AlabamaJ.Andrew McCammon, University of California, San DiegoGregory Mooser, University of Southern California, School ofDentistryVictor Munoz, National Institutes of HealthAlexey G.Murzin, Centre for Protein EngineeringCraig Nevill-Manning, Stanford UniversityLouis Noodleman, The Scripps Research InstituteHugh Nymeyer, University of California, San DiegoJose N.Onuchic, University of California, San DiegoJean-Luc Pellequer, The Scripps Research InstituteGregory A.Petsko, Brandeis UniversityMike Potter, University of California, San DiegoVijay S.Reddy, The Scripps Research InstituteCarolina M.Reyes, University of California, San FranciscoRoy Riblet, Medical Biology InstituteAndrey Rzhetsky, Columbia UniversitySuzanne B.Sandmeyer, University of California, IrvineRobert Sauer, Massachusetts Institute of TechnologyHarold Scheraga, Cornell UniversityRebecca K.Schmidt, Australian National UniversityKlaus Schulten, University of IllinoisSoheil Shams, BioDiscoverySylvia Spengler, Lawrence Berkeley National LaboratoryTim Springer, Center for Blood ResearchT.P.Straatsma, Pacific Northwest National LaboratoryIvan Suthsland, Sun Microsystems LaboratoriesMounir Tarek, National Institute of Standards and TechnologyDouglas Tobias, University of California, IrvineChandra S.Verma, University of YorkRebecca Wade, European Molecular Biology LaboratoryFrederic Y.M.Wan, University of California, IrvineArieh Warshel, University of Southern CaliforniaStephen H.White, University of California, IrvinePeter Wolynes, National Institutes of HealthWilly Wriggers, University of Illinois at Urbana-ChampaignWilliam V.Wright University of North CarolinaThomas Wu, Stanford UniversityQiang Zhenq, Scios Inc.

LIST OF ATTENDEES vi

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 8: (NAS Colloquium) Computational Biomolecular Science

Table of Contents

Papers from a National Academy of Sciences Colloquium on Computational Biomolecular Science Computational biomolecular science

Peter G.Wolynes 5848

Measuring genome evolutionMartijn A.Huynen and Peer Bork

5849–5856

SMART, a simple modular architecture research tool: Identification of signaling domainsJörg Schultz, Frank Milpetz, Peer Bork, and Chris P.Ponting

5857–5864

Highly specific protein sequence motifs for genome analysisCraig G.Nevill-Manning, Thomas D.Wu, and Douglas L.Brutlag

5865–5871

A statistical mechanical model for β-hairpin kineticsVictor Munoz, Eric R.Henry, James Hofrichter, and William A.Eaton

5872–5879

Coupling the folding of homologous proteinsChen Keasar, Dror Tobi, Ron Elber, and Jeff Skolnick

5880–5883

Photoactive yellow protein: A structural prototype for the three-dimensional fold of the PASdomain superfamilyJean-Luc Pellequer. Karen A.Wager-Smith, Steve A.Kay, and Elizabeth D.Getzoff

5884–5890

New methods of structure refinement for macromolecular structure determination by NMRG.Marius Clore and Angela M.Gronenborn

5891–5898

Estimation of evolutionary distances under stationary and nonstationary models of nucleotidesubstitutionXun Gu and Wen-Hsiung Li

5899–5905

Precise sequence complementarity between yeast chromosome ends and two classes of just-subtelomeric sequencesRoy J.Britten

5906–5912

A unified statistical framework for sequence comparison and structure comparisonMichael Levitt and Mark Gerstein

5913–5920

Folding funnels and frustration in off-lattice minimalist protein landscapesHugh Nymeyer, Angel E.García, and José Nelson Onuchic

5921–5928

Optimizing the stability of single-chain proteins by linker length and composition mutagenesisClifford R.Robinson and Robert T.Sauer

5929–5934

Architecture and mechanism of the light-harvesting apparatus of purple bacteriaXiche Hu, Ana Damjanovi , Thorsten Ritz, and Klaus Schulten

5935–5941

Electrostatic steering and ionic tethering in enzyme-ligand binding: Insights from simulationsRebecca C.Wade, Razif R.Gabdoulline, Susanna K.Lüdemann, and Valère Lounnas

5942–5949

Computer simulations of enzyme catalysis: Finding out what has been optimized by evolutionArieh Warshel and Jan Florián

5950–5955

TABLE OF CONTENTS vii

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

PROCEEDINGS OF THE NATIONALACADEMY OF SCIENCES OF THE

UNITED STATES OF AMERICA

Page 9: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95. p. 5848, May 1998Colloquium PaperThis paper is the introduction to the following papers, which were presented at the colloquium “Computational Biomolecular

Science,” organized by Russell Doolittle, J.Andrew McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by theNational Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.

Computational biomolecular science

PETER G.WOLYNES

School of Chemical Sciences, University of Illinois, Urbana-Champaign, Urbana, IL 61801In this century, the study of the molecules of life has transformed the practice of biology as a whole. Molecular thinking now influences

the research agenda for scientists studying both the behavior of individual cells and organisms, and the relationships between organisms as innatural history. Even ecology and anthropology are being influenced by this molecular revolution. It is impressive that this transformation has,to a large extent, been made possible by simply identifying (with very clever strategies!) active biological molecules and cataloging theirinformation content through their sequences. One result of all this activity is that raw data about life at the molecular level have becomeabundant, but understanding its biological meaning remains, in many if not most respects, perplexing. Fortunately, just at this stage, newapproaches to understanding the connection between biomolecular sequence and physiological behavior are coming forward. Computation,theory, and novel experimental approaches that utilize the combinatorial power of the genetic code allow us to begin to understandbiomolecular function from both the bottom-up atomistic point-of-view of the physical sciences and the top-down view usually associated withthe evolutionary perspective.

The goal of this colloquium was to bring together some of the workers from different scientific disciplines who are approaching theseproblems by using quantitative methods. Because computation plays such a large part in exploiting the information content of sequence data,the conference was entitled “Computational Biomolecular Science,” although some of the essential input of new experiments to this emergingdiscipline was covered too.

From the bottom-up perspective, the first event to consider on the road from sequence to the biological behavior of an organism is thefolding of a linear polymer into a three-dimensional structure. Once a molecule is properly folded, a variety of motions still go on in the foldedstate. It is through these motions that the biological molecule can function. These dynamical aspects represent complex problems in chemistryand physics. But it is the aptness with which these functions are carried out that at last determines whether the organism containing thatmolecule can survive in the struggle with other organisms. Quantitatively understanding molecular behavior sufficiently well for understandingthis final biological goal requires much work from both the theoreticians and the experimentalists.

The top-down interpretation of molecular data appears to proceed quite differently. Avoiding the complexity of molecular theory, theevolutionary perspective takes inheritance, perhaps the most self-evident aspect of “living” things, as its central concept. Comparing sequencesbetween different organisms then provides clues to their molecular function. In this study, dominant use is made of features of molecules thatdo not change an organism’s fitness, thus allowing markers of inheritance to be reliably assigned. In a sense then the nonfunctional parts of amolecule’s structure and dynamics are the most useful to the phylogenetically inclined scientist. Convergent evolution is hard to establish bysuch studies but is critically important to those who wonder whether, from the atomistic perspective, there are indeed general themes to thescheme of life. Despite its sometimes “life as a blackbox” character, the top-down viewpoint has achieved a myriad of successes in the practicalapplications of biomolecular science.

A gap exists between the two different vantage points of looking at biomolecular information, but there are a surprising number ofcommon concepts. In understanding the folding, motions, and function of biological molecules, for example, a powerful new viewpoint thatdescribes the entire energy landscape of a biomoiecule in a statistical fashion is proving essential. Understanding and differentiating betweenthose parts of the energetics and dynamics that are biologically significant and those that can be thought of as random noise is the hallmark ofthis approach. Similarly, in the comparative top-down approach to understanding sequence data, a tremendous amount of statistical thinkingmust be done to understand whether a perceptible similarity between two sequences really means the molecules have comparable function orstructure or whether the similarity is just an accident. Just as in energy landscape theory, extracting signal from noise is the crucial point tounderstanding molecular evolution. Such frankly statistical viewpoints must also be brought together when planning modern molecular biologyexperiments that now begin to allow the study of a huge number of variants of a biomoiecule in the laboratory simultaneously at one time.

It became apparent in the meeting that, apart from the general common interest in biomolecules and the common but general theoreticalconcepts based on statistics, there were many specific problems where the top-down and bottom-up viewpoints can profitably be merged Forexample, surveys of genomes reveal widespread structural themes that may be clues to folding thermodynamics and kinetic folding routes. Forthe atomists, several studies show how the structures of specific sequences can be predicted if knowledge of the sequences of many widelydifferent but evolutionary related molecules is available. On the other hand, for the evolutionist, an a priori knowledge of structural andenergetic patterns in molecules leads to refined algorithms for comparing sequences to obtain reliable phylogenies. Also, convergent evolutioncan be recognized if both comparative and physical studies are available for proteins in the same family. This breaks evolutionary explanationout of the mold of sophisticated Kipling “just-so” stories into the quantitative mode, most prized by natural scientists.

The papers in this colloquium give a partial snapshot of computational biomolecular science today. The organizers of the meeting,J.A.McCammon, R.F.Doolittle, and I, hope these papers give the readers of the Proceedings an idea of what is going on in a branch of sciencethat is destined to grow much larger in the coming years.

© 1998 by The National Academy of Sciences 0027–8424/98/955848–1$2.00/0PNAS is available online at http://www.pnas.org.

COMPUTATIONAL BIOMOLECULAR SCIENCE 5848

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 10: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5849–5856, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine CA.

Measuring genome evolution

(ortholog/synteny/comptuer analysis/horizontal gene transfer)MARTIJN A.HUYNEN* AND PEER BORK

European Molecular Biology Laboratory, Meyerhofstrasse 1, 69012 Heidelberg. Germany, and Max-Delbrück-Centrum for MolecularMedicine, 13122 Berlin-Buch. Germany

ABSTRACT The determination of complete genome sequences provides us with an opportunity to describe and analyze evolutionat the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) wecompare genomes as “bags of genes” and measure the fraction of orthologs shared between genomes and (ii) we quantify correlationsbetween genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomescompared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the mostconserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereasgene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: theyshow a higher degree of the clustering of genes that have orthologs in other genomes.

Molecular evolution usually is studied at the level of single genes. With the determination of genome sequences we have an opportunity tostudy it at a higher, comprehensive level, that of complete genomes. This leads to the pertinent question: how can genomic information be usedto obtain useful information concerning genome evolution? The goal of this paper is to create baseline expectations for measures of genomedistances that are based on gene content. By describing some general patterns one also can identify the exceptions. Measuring evolution at thelevel of complete genomes is pertinent as it is, after all, the principal level for natural selection. Furthermore, it is intermediate to levels atwhich evolution has long been studied: namely, the molecular level in genes and genotypes, and the organismal level in the fossil record. Thegenome in principle contains all of the information necessary to bridge the gap between genotype and phenotype. For example, by-knowing thefunctions of the genes in a genome of a species we can postulate a model for its complete metabolism. However, we have to be careful not tooverstate our expectations. The situation might turn out to be analogous to that of proteins, for which, in principle, all information necessary todetermine three-dimensional structures in the form of amino acid sequences is known, yet we remain unable to predict their tertiary structures.

Genomes can be analyzed and compared for various features: e.g., nucleotide content, compositional biases of leading and lagging strandsin replication (e.g., in Escherichia coli) (1), dinucleotide frequencies (2), the occurrence of repeats (e.g., in virulence genes of Haemophilusinfluenzae: ref. 3), RNA structures, coding densities, protein coding genes, operons, the size distribution of gene families (4), etc. They also canbe compared at a variety of levels: a first-order level where we regard the genome as a “bag of genes” without taking account of interactionsbetween the various components, and a second-order level that considers whether properties of genomes are cross-correlated (e.g., the absenceof certain polynucleotides together with the presence of restriction enzymes that specifically cut these polynucleotides; ref. 5). In this paper wefocus on first- and second-order patterns in protein coding regions in genomes. Specifically we measure: (i) the fraction of orthologoussequences between genomes, (ii) the conservation of gene order between genomes, and (iii) the spatial clustering of genes in one genome thathave an ortholog in another genome. We correlate these measures with the divergence time between the genomes compared. It is not our goalto define new distance measures to construct phylogenetic trees. Rather it is to analyze the conservation and differentiation of patterns betweengenomes, to show how we can extract useful information from these, and to analyze at what relative time scales they change. The analyses aredone on the first nine sequenced Archaea and Bacteria that were publicly available: H.influenzae (6), Mycoplasma genitalium (7),Synechocystis sp. PCC 6803 (8), Methanococcus jannaschii (9), Mycoplasma pneumoniae (10), E.coli (1), Methanobacteriumthermoautotrophicum (11), Helicobacter pylori (12), and Bacillus subtilis (13). Although the total number of publicly available genomesequences is growing rapidly, the trends that we observe should remain largely unchanged with the comparison of new species, given thediverse range of evolutionary distances of the species compared in this paper.

Methodological Issues in Comparisons of Genomes

Identification of Orthologous Genes. Defining orthology. In comparing the genes of different genomes it is important that we avoidcomparisons of “apples and pears”: i.e., that we are able to identify which genes correspond to each other in the various genomes. Fitch (14)introduced the term “orthologs” for genes whose independent evolution reflects a speciation event rather than a gene duplication event. “Wherethe homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example,alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that thehistory of the gene reflects the history of the species (for example, alpha hemoglobin in man and mouse) the genes should be called orthologous(ortho=exact)” (14). Note that orthology and paralogy are

*To whom reprint requests should be addressed at: European Molecular Biology Laboratory, Meyerhofstrasse 1,69012 Heidelberg.Germany, e-mail: [email protected].

© 1998 by The National Academy of Sciences 0027–8424/98/955849–8$2.00/0PNAS is available online at http://www.pnas.org.

MEASURING GENOME EVOLUTION 5849

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 11: (NAS Colloquium) Computational Biomolecular Science

defined only with respect to the phylogeny of the genes and not with respect to function.Identifying orthology by using relative levels of sequence identity. Ideally one would expect that the orthologous genes of two genomes are

those that have the highest pairwise identity, having bifurcated relatively recently compared with genes that duplicated before the speciation.The most straightforward approach to identifying orthologous genes is to compare all genes in genomes with each other, and then to select pairsof genes with significant pairwise similarities. A pair of sequences with the highest level of identity then is considered orthologous.

Auxiliary information for detection of orthology. Auxiliary information that is useful to assess orthology is “synteny”: the presence in bothgenomes of neighboring sequences that are also orthologs of each other. As shown below, there is little conservation of the order of genes ingenomes in evolution at a time when divergence of their orthologous genes reaches a level of 50% amino acid identity (see Fig. 3). Hence thepotential for using synteny for identifying orthologs is limited mainly to genomes that have speciated only relatively recently. A second type ofauxiliary information that can be used is the comparison of genes with those of a third genome. If two genes from different genomes have thehighest level of identity both to each other and to a single gene from a third genome, then this is a strong indication that they are orthologs (seeref. 15 for a large-scale implementation of this idea). However for a large fraction of genes identifying orthologs by relative sequence identity ishampered by a variety of evolutionary processes. We describe these in the following sections.

Sequence divergence. At large evolutionary distances, e.g., between Archaea and Bacteria, sequence similarities may be eroded to such anextent that the distance between orthologous sequences is similar to that between sequences that are merely part of the same gene family. Moredramatically, homolog sequences can diverge “beyond recognition,” such that the similarity between two orthologs is not higher than thesimilarity between sequences that are not part of the same gene family and automatic procedures for the recognition of homology fail. A recentsurvey of genes in Drosophila shows that one-third of the cDNAs code for very fast evolving genes, for which the frequency of amino acidsubstituting mutations is only a 2-fold lower than that of silent mutations, leading to a situation where homologous proteins are barelyrecognizable after 8,000 years of evolution (16).

Nonorthologous gene displacement. A second event problematic to ortholog identification is nonorthologous gene displacement. Thisoccurs when two nonorthologous genes that are unrelated or only remotely related perform the same function in two organisms (17). Thisoccurs relatively frequently: a comparison of M.genitalium to H.influenzae revealed 12 clear-cut cases (17). As a consequence orthologs maynot be detectable (or are classified as paralogs) in another organism even when the corresponding function is retained.

Gene duplication, gene loss, and horizontal gene transfer. A third process that restricts the identification of orthologous genes is that ofgene loss in combination with gene duplication. If two genomes lose different paralogs of an ancestral gene that was duplicated before thespeciation event, the remaining genes have highest sequence identity even though they are not orthologs (18). One may test for such an eventby checking whether the protein similarity falls into an expected range. This is done implicitly by including (presumably orthologous)sequences from other species in the phylogeny and checking whether the gene tree is in accordance with the species tree (18, 19).Inconsistencies between the species tree and the gene tree can indicate nonorthologous relationships between genes. However, they also can becaused by horizontal gene transfer, in which case the genes still could be orthologs. In general, the identification of orthologous sequences,horizontal gene transfer, and ancient gene duplications cannot be distinguished. Besides the construction of phylogenetic trees an additionalstrategy for finding horizontal gene transfer is the comparison of nucleotide frequencies within a genome. Recently transferred genes oftendisplay nucleotide frequencies that deviate significantly from the rest of the genome (20, 21). A conservative estimate of the amount of genesthat recently have been transferred to E.coli, based on nucleotide frequencies and dinucleotide frequencies in genomes is 10%—15% of the E. coli genome (Phil Green, personal communication: ref. 21). A third strategy for finding horizontal gene transfer is synteny. Because gene orderis rarely conserved in evolution, the presence in two distant evolutionary branches of the same order of genes, combined with the absence ofthis gene order in other more closely related branches, can point to horizontal gene transfer. This strategy has been used successfully to find theexample of horizontal gene transfer described in Fig. 1.

Orthology in multidomain proteins. In multidomain proteins two levels of orthology can be distinguished: one is at the level of singledomains, a second at the level of the whole protein. This may lead to situations where nonorthologous proteins possess orthologous domains.Modularity of genes in the sense that modules can have different positions, but the same function, in various proteins, is not well documented inBacteria and Archaea. A first step toward modularity, the presence of “gene fusion” or “gene splitting,” however, does occur regularly.Comparative analysis of the genomes H.influenzae and E.coli showed 10 (24) clear-cut cases of genes that were separate in E.coli(H.influenzae), but that were part of a single gene in H.influenzae (E.coli) (unpublished data).

A much more complicated scenario, for which many of the factors described above (multidomain proteins, synteny, and horizontal genetransfer) are involved, is shown in Fig. 1. In general, a combination of the various evolutionary processes described above leads to a situationwhere, although orthology was defined originally as a one-to-one relationship between proteins, it must be considered a many-to-manyrelationship.

From homologs to orthologs. The advent of powerful, easyto-use tools, such as PSIBLAST (22), to find homologous sequences is likely toshift the emphasis in sequence analysis from predicting homology to predicting orthology. It is clear that, at present, there is not a single,simple, and perfect solution to the question of orthology. Orthology is methodologically defined, that is, dependent on what is asked of thegenomes that are compared, different methods to find orthologous genes are being used. We use a minimal definition when we are interestedonly in the number of orthologs shared between genomes at various phylogenetic distances. Orthologs then are defined in the followingmanner: (i) They have the highest level of pairwise identity when compared with the identities of either gene to all other genes in the other’sgenome; (ii) the pairwise identity is significant (E, the expected fraction of false positives, is smaller than 0.01), and (iii) the similarity extendsto at least 60% of one of the genes. The region of similarity is not required to cover the majority of both genes to include the possibility of genefusion and gene splitting. In more detailed comparisons between a small number of genomes, auxiliary information was used to determineorthology, such as the order of genes and the comparison to genes from a third genome (see legend to Fig. 1).

Given all of these complications in the finding of orthologs and the oversimplified view of evolution that the term suggests, one couldconclude that it is better not to use it at all, or only in those cases where one does not have conflicting information from various sources aboutthe phylogeny of the genes. One also can argue that it is exactly these cases where there are conflicts in the information about orthology fromdifferent sources that evolution shows some of its most interesting aspects. Orthology is an important refinement over homology in describingthe phylogenetic relations between genes, as long

MEASURING GENOME EVOLUTION 5850

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 12: (NAS Colloquium) Computational Biomolecular Science

as one always keeps in mind the caveats described above and as long as the methods for determining orthology are well defined.

FIG. 1. An example of complexities in assigning orthology to multidomain proteins. The M.thermoautotrophicum genesMTH444 (a sensory transduction histidine kinase) and MTH445 (a sensory transduction regulatory protein) are orthologs ofthe Synechocystis sequences slr0473 (phytochrome; ref. 41) and slr0474, respectively (the gene nomenclature is from theGenBank files of complete genomes, the first letters of gene names generally represent the initials of the genomes). Thearguments for orthology are: (i) The genes have a 34.8% and a 40.2% identity to each other, which is significantly higherthan either of them has to other sequences in the other’s genome. (ii) They are neighboring genes in both genomes, (iii) BothMTH444 and slr0473 have the highest level of identity to a single sequence from a third species Archeoglobus fulgidus (42),AF1483, the same is true for MTH445 and slr0474 with respect to AF1472. Interestingly, the level of identity of theSynechocystis sequences slr0473 and slr0474 is significantly higher to the M.thermoautotrophicum and A.fulgidus sequencesthan it is to any of the sequences in the Bacteria, including sequences in Synechocystis itself. The reverse is even moredramatic: MTH445, AF1472, and MTH444, AF1483 are more identical, not only to their Synechocystis orthologs, but also to27 respectively 28 other sequences in Synechocystis than they are to sequences in their own genomes. These 27 (28)sequences are paralogs of slr0473 (slr0474). The similarity between MTH444 and AF1483 is slightly lower than that betweenAF1483 and slr0473, whereas the similarity between AF1472 and MTH444 is significantly higher than that of either of themto slr0473. Neighbor-joining clusterings of the histidine kinase orthologs together with their most similar sequences from thethree genomes (A) illustrates the most likely evolutionary scenario: a horizontal transfer of the genes in the branch that hasled to Synechocystis, to the branch leading to M.thermoautotrophicum and A.fulgidus. Given the relative similarities of theproteins, this event occurred after a major amplification of the histidine kinase family in Synechocystis and not long beforethe split of the branches that led to M.thermoautotrophicum and A.fulgidus. The fact that none of the proteins have adetectable homolog in M.jannaschii, which branched off in the Archaea not long before the branching of A.fulgidus andM.thermoautotrophicum, supports this hypothesis. The only inconsistency is the fact that in the clustering of the kinases,AF1483 and slr0473 are slightly more similar to each other than either is to MTH444. (B) Domain architecture of slr0473,AF1483, and MTH444. The genes slr0473 and AF1483 are multidomain proteins, carrying GAF (43) domains and PAS (44,45) motifs at their N terminus. The PAC motif (44, 45) could be detected only in AF1483. The GAF domain and PAS andPAC motifs are absent in MTH444, and have been replaced by three transmembrane regions (see ref. 11). All three genespossess a histidine kinase domain (HisKc) at their C terminus; 3� to the slr0473 and MTH444 genes are the regulatoryresponse genes slr0474 and MTH445. The distances between the reading frames are short: 15 nucleotides in Synechocystisand the reading frames overlap in M.thermoautotrophicum. In A.fulgidus the spatial association between these genes isabsent. The absence of the GAF and PAS domains in MTH444 might have caused different selective constraints in MTH444than in slr0473 and AF1483, and thus increased its rate of evolution, thereby reducing its similarity to its A. fulgidus andSynechocystis orthologs at a relatively high rate. The GAF, PAC, and PAS domains were predicted by using the SMARTsystem (ref. 46; http://www.bork.embl-heidelberg.de/Modules/sinput.shtml).

Timing Genome Divergence. To compare the rates at which various properties of genomes change, a central reference for the divergencebetween genomes is required. Measurement of the divergence times between the three “domains” (Archaea, Bacteria, and Eukarya) on the basisof protein dissimilarities recently has gained considerable attention and has been the subject of some controversy (see ref. 23 and referencestherein). The estimates of the date of the last common ancestor vary from 2 billion (24) to 3–4 billion years ago (23). The major assumptions inestimating divergence times from distances between protein sequences are: (i) The proteins are of vertical descent; i.e., they have not beenhorizontally transferred into the genome following the speciation of the species compared; and (ii) the proteins act as a molecular clock, havingrates of amino acid substitutions that do not vary over time and between the lineages. Here we use proteins to scale divergence between andwithin the Archaea and the Bacteria. It is not our intention to estimate absolute divergence times, rather it is to compare the different relativerates at which genomes evolve. Thus we translate the protein dissimilarities between the species into amino acid substitutions per position pergene, using an equation derived by Grishin (25), which corrects for variations in substitution rates for both amino acids and sites: q=ln(1+2d)/2d, where q is the fraction of identical amino acids between the proteins and d is the number of amino acid substitutions per site. Grishin’sequation recently was used by Doolittle el al. (23) and gives reasonable estimates for the divergence between Bacteria and Archaea. Stringentcriteria were used to select a set of genes that had orthologs in all of the nine genomes compared: (i) Each gene had the highest level of identityto at least five of the other genes (relative to other genes in those five genomes, see our minimal definition of orthology above); and (ii) therewere no conflicting hits, from each genome only one protein was selected. The resulting set of 34 proteins is surprisingly small. It contains 17ribosomal proteins, five tRNA synthetases, two signal recognition particles, two proteins with unknown function, and eight metabolic enzymes.Interestingly, the set consists almost exclusively of proteins that interact with RNA or synthesize RNA. In estimating divergence times of thegenomes of Archaea and Bacteria it could be useful to check whether the protein similarities follow the phylogenetic tree (23) given thepreviously recognized ancient horizontal transfer of metabolic enzymes from Bacteria to Archaea (26), and more recent occasions of horizontalgene transfer (Fig. 1). However, because Archaeal genomes are chimeric, they were treated as

MEASURING GENOME EVOLUTION 5851

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 13: (NAS Colloquium) Computational Biomolecular Science

such by obtaining a central reference for the distance between genomes by averaging over the proteins’ distances, irrespective of theirphylogenetic trees. As Grishin’s equation tends to overestimate the number of amino acid substitutions per position for low levels of identitiesbetween genes (27), the median of the estimates of the number of amino acid substitutions was used in preference to the mean. The results areused in the following sections.

Comparing Genomes as “Bags of Genes”

Shared Orthologous Genes. The decrease of the number of shared orthologs in time. A straightforward comparison between genomessimply considers genes, and not the correlation between genes: i.e., a genome is regarded as a “bag of genes.” Taking this a step further, wemeasure how the number of shared orthologs between two genomes decreases with their divergence time (Fig. 2). The results show that thefraction of shared orthologous sequences decreases rapidly in evolution, faster than the level of pairwise identity between the shared orthologs.Although the fraction of shared orthologs between Archaea and Bacteria is less than among the Bacteria, the most dramatic reduction in thefraction of shared orthologs takes place on shorter time scales within the Bacteria and Archaea, when protein identity levels between genomesare still above 50%.

Non-tree-like aspects of the evolution of gene content. Even over large evolutionary distances such as those between Archaea and Bacteriadifferent pairs of genomes share different orthologs. For example. M.genitalium shares different orthologs with M.jannaschii than withM.thermoautotrophicum (see legend to Fig. 2). This demonstrates a nontree-like aspect of the evolution of the gene content of genomes:phylogenetically closely related species do not share orthologous genes that either of them shares with a phylogenetically distant species.

Differential Genome Analysis. Pairwise genome comparison. Instead of focusing on genomes’ similarities one can focus on theirdissimilarities; i.e., “differential genome analysis” (28). Such analysis can be particularly revealing if the genomes are closely related but havedifferent phenotypes, in which case one can identify the genetic basis for their differences. For example, of the genes in the pathogenH.influenzae that do not have a homolog in the relatively benign E.coli, a large fraction, 60% are (potentially) involved in H.influenzae’spathogenesis (28). These genes encode proteins that are located on the surface of the cell or are involved in the production of toxins, or arevirulence factors, or are homologous to proteins present only in pathogenic species. By contrast, of the proteins in H. influenzae that do have anortholog in E.coli only an estimated 12% can be considered host interaction factors.

Multiple genome comparison. Differential genome analysis can be extended to multiple genomes. One then can analyze the correlationbetween shared gene content and shared phenotypic features of the species compared. This is demonstrated in a comparison of the twopathogens H.influenzae and H.pylori with E.coli. H.influenzae and H.pylori share 17 orthologs that do not have a homolog in E.coli. Of these, alarge fraction (12) are related to pathogenicity (unpublished data). Differential genome analysis also can be used to select genes responsible forother differences in phenotypes, e.g., metabolism. The main requirement is that the genomes are sufficiently close in evolution that theidentification of orthologs is reliable and that the differences in genome content reflect mainly the phenotypic feature that one is interested in.

Measuring Correlations Between Genes

Conservation of the Spatial Association of Genes. Quantification of the differentiation of gene order. Synteny, the conservation of theorder of genes, has been extensively studied already. Although some conservation of the order of genes in genomes has been reported (29, 30),the emphasis has been on the the drastic rearrangement of gene order in evolution (31–33). The evolution of the spatial organization of thegenome is being studied for three reasons: (i) To calibrate the rate at which it evolves. (ii) To study the genome organization of the lastcommon ancestor (34). Shared gene order between the Archaea and the Bacteria is assumed to date back to their last common ancestor, withthe exception of horizontal gene transfer (Fig. 1). (iii) To estimate the time scale at which gene regulation changes during evolution. The spatialassociation of genes is related to their regulation, e.g., in the case of operons.

FIG. 2. The relationship between genome similarity, measured as the fraction of shared orthologs, and time, measured as thenumber of amino acid substitutions per protein per position in a set of 34 orthologs.+shows the fraction of sequences in agenome A that has an ortholog in another genome B, and vice versa. This measure is asymmetric, a relatively small genomelike H.influenzae is more similar to a large one like E.coli than E.coli is similar to H.influenzae. ` shows the average of thetwo asymmetric similarities. Here we use a minimal definition of orthology: sequences that between two genomes have thehighest, significant (E<0.01) level of pairwise identity, that covers at least 60% of one of the proteins are regarded asorthologs. Sequences were compared with the Smith-Waterman algorithm (47), using a parallel Bioccellerator computer. Therelationship between sequence identity and the number of amino acid substitutions per position as calculated with Grishin’sequation (25) is given for comparison. If one assumes that the divergence time between the Archaea and Bacteria is 3.5billion years (23), the unit of one amino acid substitution corresponds to about 875 million years. In this estimate ofdivergence time the Mycoplasmas and H.pylori are not included, because they have a relatively high rate of evolution. Thehighest six divergence times correspond to the comparisons of the Mycoplasmas and H.pylori with the Archaea. As is clearfrom the figure, the fraction of shared orthologs between genomes decreases more rapidly in evolution than does the proteinidentity. Note that the base level of shared orthologs at which the figure saturates consists only partly of a set of sequencesthat are shared by all the genomes compared. For example, there are 15 orthologous pairs shared between M.genitalium andM.thermoautotrophicum of which none of the genes has a homolog at the E<0.01 level in M.jannaschii. Of this set. the oneswith the highest level of protein identity are: DnaK and DnaJ (MG305 and MG019), heat shock proteins with 51% and 50%identity, respectively to their M.thermoautotrophicum ortholog, deoxyribose-phosphate aldolase (MG050) with 40% identity,a pyrophosphatase (MG351) with 40.5% identity, and a transcriptional regulator (MG448) with 45% identity. Genes that areshared by M.genitalium and M.jannaschii but that are absent in M. thermoautotrophicum, include proteins from the glycolysislike pyruvate kinase (MG216) with 29.1% identity and glucose-6-phosphate isomerase (MG111) with 27% protein identity.

The conservation of gene order was related to genome divergence time (Fig. 3). The results show a drastic rearrange

MEASURING GENOME EVOLUTION 5852

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 14: (NAS Colloquium) Computational Biomolecular Science

ment of genomes within the first time unit, during which protein identity levels remain above 50%, after which a saturation level is reached.Notice that the order of orthologous genes is less preserved than their presence (compare with Fig. 2). At the divergence time at which thesaturation level is reached, the genes that are still paired are in general subunits of proteins, ribosomal proteins or proteins involved in ABCtransport. A detailed examination (T.Dandekar, M.A.H., and P.B., unpublished data) of all conserved pairs of proteins in three Gram-negativebacteria (E.coli. H.influenzae, and H. pylori) and in three Archaea (M.thermoautotrophicum, M. jannaschii, and A.fulgidus) has shown that, fornearly all cases, there is experimental evidence for direct physical interaction between these proteins (see also ref. 31). As mentionedpreviously, this observation has implications for the study of horizontal gene transfer. Synteny between phylogenetically distant species ofgenes for proteins that do not show physical interaction indicates recent horizontal gene transfer events.

FIG. 3. Conservation of the order of genes within the genome. Shown are the number of genes that are orthologs in bothgenomes, and that have at least one neighboring gene that is the same ortholog in both genomes, divided by the total numberof shared orthologs between the genomes. Thex axis shows the divergence of the genomes measured in amino acidsubstitutions per position. The figure clearly indicates the rapid differentiation of gene order in evolution. Gene orderbetween genomes is less conserved than the fraction of shared orthologs (compare with Fig. 2).

Gene order and operons. Given the widely accepted concept of the operon, it is perhaps surprising that there is so little conservation ofgene order. Why the gene order that is conserved only concerns proteins that show physical interaction might be explained by Fisher’s model ofgene clustering (35). Fisher argued that the linkage between genes of proteins that function well together will tend to increase, to prevent theseparation of a co-adapted pair of alleles by recombination.

It is clear that operons do not only exist of genes for proteins that show physical interaction (reviewed in ref. 36). However what isconserved of operons over large time scales seems indeed to concur with Fisher’s hypothesis. A theory that explains the rearrangement ofoperons has to include an explanation for the existence of operons. The overall rearrangement of operons does not support any theory that isbased on functional relationships of the proteins coded by the genes in the operon, unless one specifically can show that functional relationshipsof the genes change over the time scales on which we observe the rearrangement of operons. The recently proposed theory of “selfish operons”proposes that operons exist because they increase the probability that genes that function together are transferred together in horizontal genetransfer (36). This model was based on the observation that operon structure is conserved between E.coli and Salmonella typhimurium. Themodel applies only to “nonessential” genes, genes that are relatively dispensable, which can be lost and then reintroduced into the genomethrough horizontal operon transfer. It, for example, does not apply to the ribosomal genes that are strongly clustered, are essential, and forwhich we have no evidence for horizontal gene transfer. It does, however, apply to pathogenicity islands and pathogenicity islets, clusters ofgenes that play a role in pathogenicity, and do indeed show evidence for horizontal gene transfer (37).

Regulatory Elements. With the determination of ortholo gous genes and conservation of gene order one can begin to determine whetherintergenic regions are conserved. The degree of conservation of intergenic regions is remarkably low and is diverging much faster than the geneorder (Y.DiazLazcoz, M.A.H. and P.B., unpublished results). The pattern in Fig. 4 can be regarded as an exception, demonstrating that at leastin some cases gene regulation is preserved. At the 5� end of the ribosomal genes rpl11 and rpll in E.coli lies an RNA secondary structurepotentially involved in the regulation of expression of the rpl11 operon (38). The structure is conserved

FIG. 4. Conservation of an RNA secondary structure at the 5� end of rpl11 operon in Bacterial genomes. The order of theribosomal protein genes rpl11 and rpll is conserved in all of the Bacteria analyzed. The gene nusG is a transcriptionantitermination factor. Amif is an oligopeptide transport ATP-binding protein, and deoD codes for a purine-nucleosidephosphorylase. The number between the first and second gene indicates the length of the intergenic region. Surprisingly, thesecondary structure is absent from H.pylori, even though it shares the presence of nusG 5� of rpl11 with E.coli, whereasH.influenzae lacks NusG at this position. Notice furthermore that the element has been deleted in H.pylori rather than lostbecause of point mutations, as there is no space left between nusG and rpl11 in H.pylori. The element is also present inM.pneumoniae, but is absent from the Archaea. The element is part of the 5� leader of the L11 mRNA sequence and is likelyto function in the autoregulation of the rpl11 operon (ref. 38 and Y.Diaz-Lazcoz, M.A.H. and P.B., unpublished data).

MEASURING GENOME EVOLUTION 5853

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 15: (NAS Colloquium) Computational Biomolecular Science

in all Bacterial genomes analyzed in this paper, with the notable exception of H.pylori.Co-Occurrence of Genes. Some genomes are more organized than others. If neighboring genes tend to function together in one genome,

as they do in the case of operons, then they should both occur in another genome, even if they are not neighbors or part of the same operon. Weshow (Fig. 5A) that this is indeed the case. If gene A has a neighboring gene B, then if the ortholog of B (B�) occurs in another genome theprobability that the ortholog of A (A�) occurs in the other genome is increased (compare Fig. 2). In other words, orthologs shared between twogenomes tend to be clustered in at least one of the genomes. Part of the results of Fig. 5A are caused by genes that occur as neighbors in both ofthe genomes compared. The analysis was repeated to only include genes that are separated in one genome (X), but neighbors in another genome(Y). The fraction of genes that are neighbors in Y was compared with the expected fraction, given a model of random shuffling of genes (seeFig. 5B for methods). Results show that genes from a genome Y that have an ortholog in genome X tend to cluster in Y. The trend is present inall genomes except M.genitalium, and is particularly pronounced in the genomes of E.coli and B.subtilis. This surprising results suggests thatmost genomes are organized, yet some genomes are more organized than others. We assume that the genes that occur in one genome and areneighbors in another genome are in some way or another related in function. One explanation for the high degree of clustering in E.coli andB.subtilis is they consist to a large fraction of recent horizontal gene transfers, which could increase the prevalence of polycistronic operons intheir genome.

Co-occurrence of genes and the conservation of pathways. Instead of analyzing spatial association of orthologs, one can analyze whetherorthologs show “genome association”: i.e., they either occur together in a genome or are both absent from a genome. Such an analysis could, inprinciple, be used to reconstruct which genes are functionally related. The fact that orthologs that both occur in two genomes have a relativehigh probability of spatial association in one of the genomes (Fig. 5A), even if they are separated in the other genome (Fig. 5B), in itself pointsto the usefulness of this idea. By analogy to approaches using the covariation of the nucleotide content of positions in RNA (39) to predictwhich positions interact with each other, one can use the covariation in the occurrence of proteins to create a model of which proteins dependfor their function on each other. Such information could be used to reconstruct metabolic pathways or signaling pathways. The importantassumption is that the structure of the pathway was constant throughout evolution. Nonorthologous gene displacement, where a gene assumesthe functions of another in a pathway suggests that pathways are more conserved than the presence of orthologous genes. Our observation ofthe co-occurrence of the genes dna J and dnaK in a small set of orthologs that are shared by M.genitalium and M.thermoautotrophicum, but notby M.jannaschii (see legend Fig. 2), dnaK shows that the correlation of functionally related genes is present in phylogenetically distant species.

The existence of associated genes and the conservation of this association are important parameters in determining the degree of epistatisof genome evolution and determine the shape of the “adaptive landscape” (40) in which genome evolution operates. For an analysis ofcovariation in the occurrence of genes to be statistically meaningful more genomes then the nine that were analyzed here are required.Furthermore one needs to correct for the “baseline” probability that a gene from one genome has an ortholog in another genome, whichdepends on phylogenetic distance between the genomes (Fig. 2).

Comparing Rates of Genome Evolution

We have studied several indicators of genome evolution and followed their conservation over time (Fig. 6). The resulting calibrationcurves do quantify not only the divergence of these indicators, but also have practical value as they show what information can be extractedfrom new microbial genomes

FIG. 5. (A) The probability that a gene in genome A has an ortholog in another genome B if a neighboring gene in A has anortholog in genome B. The probabilities clearly increase, as compared with the average probability of having an ortholog inanother genome (compare Fig. 2). (B) The relative degree of clustering of genes in one genome (A) that have an ortholog inanother genome (B). The analysis includes only genes that are clustered (“neighbors”) in genome A, but not in B (and viceversa). Shown is the ratio of the number of genes in A that have an ortholog in B and have at least one neighboring gene thatalso has an ortholog in B, divided by the expected number. The expected number of genes that are neighbors in a genome,given a random distribution, is calculated as follows: Given X genes that are randomly distributed over a genome with Y loci,the probability that a gene from X has no neighboring genes from X (it lies isolated) is the probability that it has no left-neighbor from X nor a right-neighbor from X:P0=[(Y–X)/(Y–1)]* [(Y–X–1)/(Y– 2)]. The expected number of genes from Xwith at least one neighbor from X:P1,2=1–P0. The fraction of genes in genome A with at least one neighbor that also has anortholog in genome B is thus divided by P1,2 to get to the relative clustering of the genes in genome A. The relative clusteringis averaged over the genome comparisons of one genome versus the eight other genomes. The names of the species have beenabbreviated to the first letters of their genus and species name. All genomes, except M.genitalium show a more than expectedclustering of genes. Given its small size, M.genitalium has relatively little room to cluster the genes that have an ortholog inanother genome above the expected level of clustering: i.e., most of the genes that have an ortholog in another genome areexpected to be neighbors in M.genitalium. The correlation with genome size is not perfect however. For example,Synechocystis, which has a relatively large genome, shows relatively little genome organization.

MEASURING GENOME EVOLUTION 5854

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 16: (NAS Colloquium) Computational Biomolecular Science

given their phylogenetic position. The calibration curves shall require refinement when more data become available but they already providelevels of expectation, deviations from which are of potential interest (e.g., synteny of genes in distant species that cannot be found in otherspecies is an indicator of horizontal gene transfer; Fig. 1). In particular, more relatively closely related genomes that have protein identity levelshigher than 50% shall be essential to provide more precise estimates of the rates at which genome organization and gene regulation evolve. Thecalibration curves also should influence the analysis strategy, e.g., if a closely related genome is available, orthologs are relatively easy todiscriminate from other members of multigene families. By analogy to profile search techniques, it is helpful to include not too closely relatedbut also not too divergent species into the first round of the analysis, where the closeness of the relationship depends on the features one wantsto identify. For example, to study the evolution of gene regulation one needs to compare more closely related species than to study theevolution of gene order. To study the evolution of gene content, one needs to compare even less related species, whereas the study of theevolution of metabolism requires the comparison of the most distantly related species.

FIG. 6. Relative rates of genome evolution. The curves were fitted from the fraction of shared orthologs (Fig. 2) and theconservation of the order of genes (Fig. 3), the curve that shows the relationship between protein identity and the number ofamino acid substitutions per position according to Grishin’s equation (Fig. 2), was added for comparison. Intergenic regionsare even less conserved than the order of genes. Nonorthologous gene displacement indicates that metabolism is moreconserved than the fraction of shared orthologous genes.

Current analysis of genomes is driven by the prediction of functional features at the molecular and cellular level; it is based on thepresence and absence of certain genes in the context of phenotypic expectations. Expectations about horizontal gene transfers and the loss, theacquisition or displacement of entire pathways (the entire metabolism in the case of the Archaea) and the study of the correlations of geneoccurrence will enable us to identify functional cascades in greater detail. Identification of weak regulatory signals in the genomes requires asensitive comparative analysis. The puzzling evolution of nonconserved but ever-present operons is only one indication that many genetic andevolutionary mechanisms are yet to be detected and quantified.

We are very grateful to Chris Ponting, Berend Snel, Yolande Diaz-Lazcoz. Thomas Dandekar, and Joerg Schultz for providing data anduseful discussions. The work was supported by the Bundesministerium für Bildung, Wissenschaft, Forschung and Technologie (Germany) andDeutsche Forschungsgemeinschaft.1. Blattner, F.E., III, Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode. C.K. & Mayhew, G.F. (1997) Science 277,

1455–1462.2. Karlin, S., Mrazek. J. & Campbell, A. (1997) J. Bacteriol. 179, 3899–3913.3. Hood, D.W., Deadman, M.E., Jennings, M.P., Bisercic. M., Fleishmann, R.D., Venter, J.C. & Moxon, E.R. (1996) Proc. Natl. Acad. Sci. USA 93,

11121–11125.4. Huynen. M.A. & van Nimwegen, E. (1998) Mol. Biol. Evol., in press.5. Gelfand, M.S. & Koonin. E.V. (1997) Nucleic Acids Res. 25, 2430–2439.6. Fleishmann, R., Adams, M., White, O., Clayton, R.A., Kirkness. E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty. B.A. & Merrick. J.M.

(1995) Science 269, 496–512.7. Fraser. C.M., White, O., Casjens, S., Huang, W.M., Sutton, G.G., Clayton, R., Lathigra, R., Ketchum. K.A., Dodson, R. & Hickey. E.K. (1995)

Science 270, 397–403.8. Kaneko, T., Sato. S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M. & Sasamoto, S. (1996) DNA Res.

3, 109–136.9. Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton. G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A. & Gocayne, J.D. (1996)

Science 273, 1058–1072.10. Himmelreich, R., Hilbert. H., Plagens, H., Pirkl, E., Li, B. & Herrmann, R. (1996) Nucleic Acids Res. 24, 4420–4449.11. Smith, D.R., Doucette-Stamm, L.A., Deloughery, C., Lee, H., Dubois, J., Aldredge, T., Bashirzadeh, R., Blakely, D., Cook, R. & Gilbert, K. (1997)

J. Bacteriol. 17, 7135–7155.12. Tomb, J.-F., White, O., Kervalage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk. H.P., Gill, S., Dougherty, B.A.

(1997) Nature (London) 388, 539–547.13. Kunst, F., Ogasawara, N., Moszer, I., Albertini, A.M., Alloni, G., Azevedo, V., Bertero, M.G., Bessieres, P., Bolotin, A. & Borchert. S. (1997)

Nature (London) 390, 249–256.14. Fitch, W.M. (1970) Syst. Zool. 19, 99–110.15. Tatusov, R.L., Koonin, E.V. & Lipman. D.J. (1997) Science 278, 631–637.16. Schmid, K. & Tautz, D. (1997) Proc. Natl. Acad. Sci. USA 94, 9746–9750.17. Koonin, E.V., Mushegian, A.R. & Bork, P. (1996) Trends Genet. 12. 334–336.18 Page, R.D.M. (1994) Syst. Biol. 43, 58–77.19. Yuan, Y.P., Eulenstein. O., Vingron, M. & Bork, P. (1998) Bioinformatics, in press.20. Medigue, C., Rouxel, Y., Vigier, P., Henaut, A. & Danchin, A. (1991) J. Mol. Biol. 222, 851–856.21. Lawrence, J.G. & Ochman, H. (1997) J. Mol. Evol. 44. 383–397.22. Althschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.23. Feng, D.F., Cho, G. & Doolittle, R.F. (1997) Proc. Natl. Acad. Sci. USA 94, 13028–13033.24. Doolittle. R.F., Seng, D.F., Tsang, S., Cho, G. & Little, E. (1996) Science 271, 470–477.25. Grishin, N.V. (1995) J. Mol. Evol 41. 675–679.26. Koonin. E.V., Mushegian, A.R., Galperin, M.Y. & Walker, D.R. (1997) Mol. Microbiol. 25, 619–637.27. Feng, D.-F. & Doolittle, R.F. (1997) J. Mol. Evol. 44, 361–370.28. Huynen. M., Diaz-Lazcoz, Y. & Bork, P. (1997) Trends Genet. 13, 389–390.29. Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W.S., Borodovsky. M., Rudd. K. & Koonin, E.V. (1996) Curr. Biol. 6, 279–291.30. Tamames, J., Casari. G., Ouzounis, C. & Valencia, A. (1997) J. Mol. Evol. 44, 66–73.31. Mushegian, A.R. & Koonin, E.V. (1996) Trends Genet. 12, 289–290.32. Watanabe, H., Mori, H., Itoh, T. & Gojobori, T. (1997) J. Mol. Evol. 44, 57–64.33. Kolsto, A.B. (1997) Mol. Microbiol. 24, 241–248.34. Siefert, J.L., Martijn, K.A. Abdi, F., Widger, W.R. & Fox, G.E. (1997) J. Mol. Evol. 45, 467–472.35. Fisher, R.A. (1930) The Genetical Theory of Natural Selection (Oxford Univ. Press, Oxford).36. Lawrence, J.G. & Roth, J.R. (1996) Genetics 143, 1843–1860.37. Barinaga, M. (1996) Science 272, 1261–1263.

MEASURING GENOME EVOLUTION 5855

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 17: (NAS Colloquium) Computational Biomolecular Science

38. Branlant, C., Krol, A., Machatt, A. & Ebel, J.P. (1981) Nucleic Acids Res. 9, 293–307.39. Gutell, R.R., Power, A., Hertz, G., Putz, E. & Stormo. G. (1993) Nucleic Acids Res. 20, 5785–5795.40. Wright, S. (1932) in Proceedings of the Sixth International Congress on Genetics, ed. Jones, D.F. (Brooklyn Botanical Garden, New York), Vol. 1,

pp. 356–366.41. Yeh, K.C., Wu, S.H.. Murphy, J.T. & Lagarias, J.C. (1997) Science 277, 1505–1508.42. Klenk. H.P., Clayton, R.A., Tomb, J.F., White, O., Nelson. K.E., Ketchum. K.A., Dodson, R.J., Gwinn, M., Hickey, E.K. & Peterson, J.D. (1997)

Nature (London) 390, 364–370.43. Aravind, L. & Ponting, C.P. (1997) Trends Biochem. Sci. 22, 458–45.44. Zhulin, I.B., Taylor, B.L. & Dixon, R. (1997) Trends Biochem. Sci. 22, 331–333.45. Ponting. C.P. & Avarind, L. (1997) Curr. Biol. 7, R674–R677.46. Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P. (1998) Proc. Natl. Acad. Sci. USA 95, 5857–5864.47. Smith, T. & Waterman, M.S. (1981) J. Mol Biol. 147, 195–197.

MEASURING GENOME EVOLUTION 5856

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 18: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5857–5864, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

SMART, a simple modular architecture research tool:Identification of signaling domains

(computer analysis/diacylglycerol kinases/DEATH domain/disease genes/automatic sequence annotation)JORG SCHULTZ*†, FRANK MILPETZ*†, PEER BORK*†‡, AND CHRIS P.PONTING§

“European Molecular Biology Laboratory, Meyerhofstr.1, 69012 Heidelberg, Germany: †Max-Delbrunk-Center for Molecular Medicine.Robert-Rössle-Str 10, 13122, Berlin, Germany; and §University of Oxford, The Old Observatory, South Parks Road, Oxford OXl 3RR UnitedKingdom

ABSTRACT Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used toprovide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation ofsignaling domain sequences. The majority of signaling proteins are muitidomain in character with a considerable variety of domaincombinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProtand 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpectedlocations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, aubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains innetrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRYdomains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinasehomologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain ina Candida albicans protein, previously described as an integrin.

The functions of only a small fraction of known proteins have been determined by experiment. As a result, the use of computationalsequence analysis tools is essential for the annotation of novel genes or genomes, and the prediction of protein structure and function.Currently, the most informative of these techniques are database search tools such as BLAST (1) and FASTA (2) that identify similar sequenceswith associated statistical significance estimates. Current limitations of the use of these programs concern less the aspects of search sensitivityand more the functional annotation of identified homologues. Annotation terms such as “hypothetical protein” or “suppressor of spt3mutations” are helpful neither to the user’s prediction of structure and function, nor to computational procedures attempting to automaticallypredict function from sequence.

An additional aspect concerns the annotation of complete genomes. Existing eubacterial and archaeal genomes have been analyzed withlittle regard to the existence of domains, because muitidomain proteins in these organisms are relatively few in number. The domain as afunctional and structural unit in eukaryotic proteins, however, is pre-eminent. For example, the majority of human extracellular proteins aremuitidomain in character (for reviews see refs. 3 and 4) and many complex eukaryotic signaling networks involve proteins containing multipledomains with catalytic, adaptor, effector, and/or stimulator functions (5). Several dozen of such “signaling domains” are known (for a reviewsee ref. 6). The importance of modular proteins in disease is emphasized by the recent observation that the majority of positionally clonedhuman disease genes encode muitidomain proteins, many of which are, in fact, signaling proteins (7). On the other hand, the view of thedomain as a fundamental unit of structure and function is not universally accepted: not a single noncatalytic signaling domain is annotated inthe widely distributed Saccharomyces cerevisiae genome directory that catalogs the genes of this complete genome (8).

Thus, there is a need to coordinate knowledge stored in the literature with that stored in sequence databases to facilitate the research ofthose in the scientific community who require the annotation of genes and genomes. It is our goal to provide an extensively annotatedcollection of cytoplasmic signaling domain alignments that enables rapid and sensitive detection of additional domain homologues as a Web-based tool.

Because it is difficult to distinguish those domains that perform cytoplasmic signaling roles from those that primarily function in transport,protein sorting, or cell cycle regulation, and for reasons of brevity, we shall discuss those domains that fall under two categories. (i)Cytoplasmic domains that possess kinase, phosphatase, ubiquitin ligase, or phospholipase enzymatic activities or those that stimulate GTPase-activation or guanine nucleotide exchange; these activities are known to mediate transduction of an extracellular signal toward the nucleusresulting in the initiation of a cellular response, (ii) Cytoplasmic domains that occur in at least two proteins with different domainorganizations, of which one also contains a domain that is categorized under 1) (for a complete list of such domains see Table 1).

Domain collections that cover a wide spectrum of cellular functions do exist in the forms of motif, alignment block, or profile databasessuch as PROSITE (9), BLOCKS (10), PRINTS

‡To whom reprint requests should be addressed.© 1998 by The National Academy of Sciences 0027–8424/98/955857–8$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: SMART, simple modular architecture research tool; DAG, diacylglycerol; PH, pleckstrin homology; PTB,

phosphotyrosine binding; SH, Src homology; rcm, rostral cerebellar malformation gene product; HMM, Hidden Markov model.

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5857

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 19: (NAS Colloquium) Computational Biomolecular Science

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5858

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 20: (NAS Colloquium) Computational Biomolecular Science

(11), or Pfam (12) and provide a guide for the annotation of new proteins. However, there is a necessary trade-off in these collectionsbetween exhaustive coverage of domains and optimal sensitivity, specificity, and annotation quality. We have chosen to initiate the collectionof gapped alignments of signaling domains because these are imperfectly covered in large collections and often include homologues withextremely divergent sequences. This collection is designed to be updated easily and is provided with a Worldwide Web interface enablingautomatic sequence annotation with evolutionary, functional, and structural information. The resulting SMART procedure, a simple modulararchitecture research tool, offers a high level of sensitivity and specificity coupled with ease of use.

Number ofdomains annotated

Yeast genome SwissProt: S.cerevisiae SwissProt: Homo sapiensDomain Full name or function SMART SMART SPa Pfamb SMART SPa Pfamb

RAS-like small RAB 9 9 9 – 19 19 –GTPases RAN 2 2 2 – 1 1 –

RAS 3 3 3 – 11 11 –RHO 6 6 6 – 13 13 –SAR 1 1 1 – 0 0 –Others 11 10 7 24(all) 5 3 48(all)

RanBD Ran-binding domain 3 3 1 – 5 5 –RasGAP GAF for Ras-like GTPases 4 3 3 – 3 3 –RasGEF GEF for Ras-like GTPases 5 5 4 – 0 0 –RasGEFN In some RasGEFs 4 4 0 – 0 – 0RGS Regulator of G-protein

signaling3 1 1 – 11 6 –

RhoGAP GAF for Rho-like GTPases 9 6 3 – 8 4 –RhoGEF GEF for Rho-like GTPases 4 4 3 – 7 6 –SAM Sterile alpha motif 6 3 0 – 11 1 –SH2 Src homology 2 1 1 1 1 51 51 51SH3 Src homology 3 28 25 25 25 65 63 57SPRY In sp1A and Ryanodine

receptors3 3 0 – 7 0 –

TBC In Tre-2, BUB2p, andCdc16p

10 7 0 – 1 0 –

TPR Tetratricopeptide repeat 72 69 39 16 40 0 7UBA Ubiquitin-associated

domain10 8 0 – 12 0 –

UBCc Ubiquitin-conjugatingenzyme

13 13 13 13 12 12 12

UBX Ubiquitin-related domain 8 4 0 – 1 0 –VHS In VPS-27, Hrs and STAM 4 3 0 – 0 0 -VPS9 In VPS-9-like proteins 2 1 1 – 1 0 –WH1 WASp homology domain 1 1 1 0 – 2 0 –WW Conserved WW motif 9 8 7 7 9 9 9ZU5 In ZO-1 and UNC-5 0 0 0 – 4 0 –ZZ Dystrophin-like zinc finger 2 2 0 – 4 1 –Totals 86 622 544 383 290 1,137 886 704

Numbers of domains detected by SMART in the yeast genome, and in the yeast and human fractions of the SwissProt database are compared with thenumbers of domains derived from HMMer analysis and Pfam HMMs scanned against these database fractions, and the numbers of annotations inSwissProt. Many of these domains are reviewed elsewhere (5, 6), and additional references may be found via the SMART Web site (http://www.bork.embl-heidelberg.de/Modules/sinput.shtml).aAnnotations in SwissProt.bAnnotations using the hmmfs program of the HMMer package with Pfam-derived HMMs (“-” indicates where no Pfam HMM was available).

METHODS

Construction of Multiple Sequence Alignments and Choice of the Search Program. Of the 86 domain families, multiple alignments of83 had been published previously (for references, see the annotation that accompanies the SMART Web site). These alignments were refinedaccording to constraints described elsewhere (13) that included minimization of insertions/deletions in conserved alignment blocks,optimization of amino acid property conservation within these blocks, and closing of unnecessary gaps within insertion/deletion regions.Gapped alignments were constructed in preference to ungapped ones to allow the prediction of domain limits and as a result of their greaterinformation content. Care was taken to build alignments that encompassed all secondary structures of domains whose tertiary structures areknown. For remaining domains, investigations of sequence similarities beyond previously published domain limits were undertaken; thisresulted in N-terminal extension of the previously described PX domain alignment by a single predicted β-strand, and identification of aconserved N-terminal motif in guanine nucleotide exchange factors for Ras-like GTPases, Prediction of domain limits also was aided by closeproximities of domains to others with well-known limits, and to bona fide N- and C-terminal residues.

Alignments were updated to include additional predicted homologues. Because no single database searching algorithm currently is able todetect all putative homologues that are detectable by the combination of all searching methods (13), three iterative methods—HMMer, MoST,and WiseTools (14–16)—were used to detect candidate homologues (HMMer and MoST thresholds: 25 bits and E < 0.01). Before their-addition to multiple alignments, candidate homologue sequences were subjected to analyses using BLAST (1), Ssearch (2), and/or MACAW (17) toestimate the statistical significance of sequence similarities (PSIBLAST, BLAST, and Ssearch thresholds: E < 0.01). Those sequences that wereconsidered homologues based on statistical significance estimates, and to a lesser extent on experimentally determined biological context, wereused to construct alignments, profiles, and Hidden Markov models (HMMs).

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5859

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 21: (NAS Colloquium) Computational Biomolecular Science

As described above, care was taken to establish alignments representing entire structural domains. However, the termini were found to bethe least conserved regions of alignments, and several profiles represent incomplete portions of domains. In two cases, phospholipase D andprotein tyrosine phosphatase homologues, only short conserved “motifs” (conservation patterns representing an incomplete domain structure)are detectable across the domain family (18–20). For these examples, profiles/HMMs were calculated only from these short motifs to maximizethe amino acid similarity signal-to-noise ratio (13).

Assignment and Calibration of Thresholds for Automatic Runs. Score thresholds are required to provide automatic assignment of truepositives and true negatives. There is no current method, including those that provide E- or p-value representations of score significances, thatmay be relied on to provide reliable values for these thresholds in all cases. As a result, manual intervention was necessary to estimate thresholdvalues on the basis of published homology arguments and, for example, on the results of individual BLAST or Ssearch queries. SWise (16) waschosen as an established algorithm able to provide similarity scores for query sequences when compared with the alignment database; however,the SMART database method can be applied to any algorithm that provides similarity scores.

For each alignment an SWise (16) threshold (Tp) was established that represents the lowest score allowable for sequences to be consideredas “true positives” or homologues. As such, this single step procedure detects many true positives but does not detect few previously proposedhomologues (“false negatives”) that score at levels just below that of the top “true negative.” A proportion of false negatives could not beassigned as homologues without further statistical evidence. However, consideration that domains such as ARM, C2, CBS. IQ, LIM, PDZ,SH2, SH3, and WW (Table 1) frequently are found as repeats, enabled several false negatives to be detected by using estimations of anadditional threshold value, Tr (Tr < Tp). Tr represents a repeats’ threshold for a protein where at least one of the repeats scores above Tp (Fig. 1).Two or more repeats scoring above the average of TP and Tr [(TP+Tr)/2] also were considered false negatives. Some domains that appear to befound only as tandem repeats (for example, EF-hands, tetratricopeptide repeats, and armadillo repeats) are reported only if two or more copiesare found that score above a low threshold Tr. To predict the subfamily of a particular domain (for example, whether a tyrosine or a serine/threonine kinase, or whether a tyrosine-specificity or a dual-specificity phosphatase) further thresholds Ts (Ts>Tp) also were estimated; nosubfamily predictions are made for those domain homologues that score above Tp but below Ts.

Subset alignments of a given domain family were constructed not only to improve the specificity of functional predictions, but also fordivergent families for which a single descriptor (profile/HMM) was found to be unable to detect the entire set of known homologues (e.g.. C2and pleckstrin homology (PH) domains; refs. 21 and 22). Construction of multiple profiles each representing different regions of the domainphylogenetic tree resulted in “overlapping” profiles that, when used in combination, found the maximal number of homologues. Sensitivity andspecificity is guaranteed with combinations of Ts and Tp. Overlapping hits from nonhomologous profiles, which can occur because of inserteddomains (23), all are reported.

Seeding and Updating Procedure. To reduce redundancy and subfamily bias within sequence families, seed alignments were calculatedby using an iterative semiautomatic procedure. In a first step all database sequences considered homologous, given the threshold proceduresdescribed above, are subjected to a CLUSTALW phylogenetic tree construction (24). Only a single sequence from every branch of the tree thatis shorter than a defined threshold (the default distance is 0.2, which corresponds approximately to 80% identity, ref. 24) is retained in thealignment. From this seed alignment, a profile is derived leading to reiteration of the database search procedure until convergence. Forexample, four iterations were required to build a Src homology 2 (SH2) seed alignment containing 95 sequences, of a total of 548 SH2 domainsidentified in the translated EMBL sequence database.

FIG. 1. Calibration of thresholds. Selection of thresholds from the distributions of SH3 domain scores. (Upper) A histogramof SWise scores for the best match (optimal alignment; in green) of proteins with a SH3 domain profile. (Lower) Similarhistograms for the second- and third-best matches (suboptimal alignments; in light blue and dark blue, respectively). Optimalalignment scores less than threshold Tp are mostly derived from sequences considered unlikely to contain SH3 domainhomologues. Threshold TP was selected as the lowest scoring true positive. Domains that are repeated twice or more in thesame protein that each score above a lower threshold (Tr) are considered to be true negatives.

With new sequences entering databases daily, seed alignments and derived profiles need to be updated accordingly. SMART incorporatesa facility whereby database daily updates are screened for the presence of signaling domains. Those that represent a new branch of the domainfamily phylogenetic tree (i.e., with a distance of greater than 0.2) are recorded for inclusion in future SMART domain set updates. Thealignments are accessible via the SMART Web server.

Implementation into a Web Server. SMART has been provided with a user interface (http://www.bork.emblheidelberg.de/Modules/sinput.shtml) that allows rapid and automatic annotation of the signaling domain composition of any query protein sequence. A graphicaldisplay is provided showing domain positions within the query sequence. The SMART set of signaling domains is annotated extensively via

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5860

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 22: (NAS Colloquium) Computational Biomolecular Science

hyperlinks to Medline and the Molecular Modeling Database via Entrez (25), thus providing easy access to information relating sequence,homology, structure, and function. As the set of signaling sequences is necessarily incomplete and as there may be other domains represented inthe query sequence, direct access also is provided to Pfam (12), a domain database that includes a variety of different domain types, yetprovides a lower representation of signaling domains and with lower sensitivity (see Discussion). Intrinsic features of the query such as coiledcoil regions (26), low complexity regions (27), and transmembrane regions (28) also are displayed. Annotated or unannotated regions of thequery sequence are able to be subjected individually to gapped BLAST searches (1), thus allowing the advantage of a reduced search spaceenabling higher sensitivity in searches.

Benchmarking Protocol. To assess the sensitivity and selectivity of SMART, results were compared with annotations held by SwissProt,because this represents the best-annotated protein sequence database extant, (and includes all those annotations covered by the PROSITE

database) as well as with the Pfam domain collection, because this represents the most comprehensive set of gapped alignments available (12).Our intention here was not to provide justifications for the inclusion or exclusion of particular sequences in domain alignments, but to compareliterature information as represented by the SMART database, with the same information as represented by SwissProt and Pfam databases. AllS.cerevisiae and human sequences were extracted from SwissProt and annotated by using the SMART protocol. Because these organisms arewell-studied and their proteins relatively well-annotated they represent a stringent test for annotation procedures. The SMART domainannotations were compared manually with

FIG. 2. Schematic representations, produced using SMART, of the domain architectures of proteins discussed in the text. SeeTable 1 for the identified domains; gray lines (no SMART match) might contain other known domains not included inSMART. Putative homologues were identified during SWise (16) searches and/or PSIBLAST (1) searches (E<0.01). (a) Domainrecognition: A novel PTB domain was identified in tensin, resulting in completion of its modular architecture assignment. APSIBLAST search with a previously predicted PTB domain in C.elegans F56D2.1 (53) yields the tensin PTB after four passes.Prediction of molecular function via domain hit: Identification of a domain homologous to band 4.1 protein in focal adhesionkinase (FAK) isoforms. FAKs are predicted to bind cytoplasmic portions of imegrins in a similar manner to that of talin,another band 4.1 domain-containing protein. A PSIBLAST search with a band 4.1-like domain (41 HUMAN, residues 206–401)revealed band 4.1-like domains in human, bovine, and Xenopus FAK isoforms by pass 3. (b) Detection of new domainsbecause of search space reduction: Putative DEP domains in ROM1 and ROM2 were identified by using SWise (16) andHMMer (14), but could not be detected by using PSIBLAST. Analysis of the regions surrounding identified domains revealedthe presence of a novel domain in the C-terminal regions of ROM1 and ROM2 that occurs also in several Ste20-like proteinkinases, and mouse citron (CNH, citron homology). A gapped BLAST search of the region of citron C-terminal to its PHdomain (CTRO MOUSE, residues 1134–1457) reveals significant similarity with yeast ROM2 (E=1×10–5). (c) Functionalpredictions for an entire domain family: A region of p62 known to bind ubiquitin (40), and its homologous sequence in theDrosophila protein ref(2)P, scored as the highest putative true negatives in a SWise search. We predict ubiquitinbindingfunctions for UBA domains. PSIBLAST searches were unable to corroborate this prediction, (d) Prediction of cellular functions:Although not indicated in the primary sources (43, 44), a DEATH domain was found in rcm and other UNC5 homologues, inagreement with a previous claim (41). At the molecular level, this domain in UNC5 is predicted to form a heterotypic dimerwith an homologous domain in UNC44 implying a cellular role in axon guidance. A gapped BLAST search with the knownDEATH domain of death-associated protein kinase (DAPK HUMAN, residues 1304–1396) predicts a DEATH domain in ratUNC5H1 with E=9×10–3). (e) Signaling domains in “disease genes”: Pyrin or marenostrin. a protein that is mutated inpatients with Mediterranean fever and is similar to butyrophilin, contains a SPRY domain. PSIBLAST with the SPRY domainof human DDX1 (EMBL:X70649, residues 124–240) yields a butyrophilin homologue by pass 5 and pyrin/marenostrin(residues 663–759) by pass 7. (f) Homologues of domains involved in eukaryotic signaling may not be eukaryotic-specific:DAG kinases have been found previously in mammals, invertebrates, plants, and slime mold. However, it is apparent thatDAG kinase homologues of unknown function are present in yeasts and in eubacteria (see Fig. 3). A gapped BLAST searchwith Bacillus subtilis bmrU (BMRU BACSU) yields significant similarities with Arabidopsis thaliana DAG kinase (KDG1ARATH; E=4×10–4) and a Schizosaccharomyces pombe ORF (SPAC4A8.07c; E=1×10–7). (g) Identification of potentialmisclassifications: A PH domain and the lack of an obvious transmembrane sequence indicates a cytoplasmic and signalingrole for a protein (INT1 CANAL) previously thought to be a yeast integrin. A PSIBLAST search with the N-terminal PH domainof pleckstrin yielded INT1 CANAL in pass 3.

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5861

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 23: (NAS Colloquium) Computational Biomolecular Science

those derived from HMMer (14) analysis, and those contained in SwissProt (Table 1); the hmmfs program and a 25-bits threshold was used forthe HMMer analysis. As the SwissProt release 34 does not contain all yeast sequences, the complete set of S.cerevisiae ORFs also wassubjected to SMART analysis (Table 1).

RESULTS

Comparison with SwissProt and Pfam. Of all protein sequence databases, SwissProt is the most extensively annotated, making use ofliterature- and sequence-derived (9) data as source material. As a result the SwissProt database is a valuable resource for investigators searchingfor hints of the structure and function of their sequences of interest. Consequently, it is appropriate to compare SMART-derived annotationswith those contained in SwissProt.

SMART detected 548 and 1,137 domains in the yeast and human subsets of SwissProt, respectively (Table 1). Of these, 165 and 251domains (30% and 22%, respectively) are not annotated in SwissProt. Many of these belong to the 29 domain families that are contained inSMART and yet are not annotated in SwissProt. By contrast, all SwissProt annotations relating to our domain set were detected by SMART,with the exception of a small set of domain fragments. Only 23 of the SMART domain families are represented by Prosite motifs or patterns.Moreover, because Prosite motifs commonly represent active site regions, it is apparent that these do not detect the several homologues ofkinases, phosphatases, or ubiquitinconjugating enzymes that have dispensed with their active site residues.

The current set of Pfam HMMs, when compared with the yeast and human SwissProt subsets, detected 290 and 704 domains. Forty-six ofthe 86 SMART domain types are not represented currently in Pfam. Moreover, the Pfam set does not yet allow subfamily annotation fordomain families such as small GTPases, protein kinases, or protein phosphatases. Pfam and HMMer were able to identify several incompletedomain sequences that SMART could not. SMART was not designed to detect domain fragments because it was considered valuable to detectcomplete domains, thereby allowing assignment of putative domain boundaries. Consequently, the HMMer (hmmfs) option of SMART hasbeen provided to allow detection of incomplete domain sequences.

Identification of Signaling Domains in Yeast. Annotation of the complete yeast genome (6218 ORFs) revealed that 420 yeast proteins(6.7%) contain at least one of the domains included in SMART. This is larger than a previous estimate that 2% of yeast proteins are involved insignaling (8), which approximates to the percentage of S.cerevisiae proteins known to be kinase homologues. SMART identifies a total of 622domains (Table 1); two or more domains occur in 96 of the 420 signaling proteins. Results of the SMART annotation of yeast proteinsidentified are summarized in a Web page (http://www.bork.embl-heidelberg.de/Modules/syeast.html), which was generated by usingSMART’S graphical output features.

These results imply an improvement by SMART on other tools and current best-annotated databases in the particular field of signaling. Anadditional feature of SMART is its ability to facilitate predictions of the structures and/or functions of proteins when a hit is recorded. Thefollowing examples illustrate several such instances that arise from a domain hit.

Domain Annotation and Deduction of Functional Features. During construction of the SMART database, tensin and focal adhesionkinase (pp125FAK), which both are localized to focal contacts, were found to contain previously unrecognized domains. Fig. 2a shows themodular architecture of tensin, an actin filament capping protein that is known to contain large coiled coil regions, an SH2 (29) and an N-terminal domain homologous to protein tyrosine phosphatases (PTPs) (20). SMART predicts a phosphotyrosine binding domain (PTB: alsocalled phosphotyrosine interaction [PI] domain) (Table 1) in tensin’s most C-terminal region, which has not previously been ascribed a domainhomology. Each of tensin’s three globular domains—PTP, SH2, and PTB/PI—have been implicated in phosphotyrosine-mediated signaling.This is consistent with previous findings that tensin is a substrate of the tyrosine kinase pp125FAK (30), which is also highlytyrosinephosphorylated when activated (reviewed in ref. 31).

Application of SMART procedures to pp125FAK homologues predicts band 4.1-homologous domains in their N-terminal regions that bindthe cytoplasmic regions of integrins (32) (Fig. 2a). Although one has to be cautious when inferring functional information simply from domainidentification, on this occasion the band 4.1 domains are likely to perform similar

FIG. 3. Multiple alignments of selected RasGEFN domains. A conserved region was found in the N-terminal regions ofseveral proteins with RasGEF (Cdc25-like) domains (37). Surprisingly, this N-terminal domain may be present in thesequence either close to, of far from, the RasGEF domain. A PSI-BLAST search using a region (residues 898–946) of C.albicansCdc25 (CC25 CANAL) and E<0.01, identified each of the sequences in Fig. 3 within nine passes before convergence.Predicted (54) secondary structure and 90% consensus sequences are shown beneath the alignments; SwissProt/PIR/EMBLaccession codes and residue limits are given after the alignments. Residues are colored according to the consensus sequence[green: hydrophobic (h), ACFGHIKLMRTVWY; blue: polar (p), CDEHKNQRST; red: small (s), ACDGNPSTV; red: tiny(u), AGS; cyan: turn-like (t), ACDEGHKNQRST; green: aliphatic (l), ILV; and magenta: alcohol (o), ST). The SwissProtsequence KMHC DICDI has been altered to account for probable frameshifts.

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5862

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 24: (NAS Colloquium) Computational Biomolecular Science

molecular functions because talin, another band 4.1 domain-containing protein, is known also to bind integrin cytoplasmic domains (33).Reducing the Search Space Enables Identification of Novel Domains. S.cerevisiae ROM1 and ROM2 are sequence-similar proteins

that each contain a PH domain and a RhoGEF domain that stimulates exchange of Rho1GDP with Rho1GTP (34). Construction of the SMARTdatabases led to the identification of a putative DEP domain (35) in both ROM1 and ROM2 (Fig. 2b). Comparison of the ROM1 and ROM2sequences showed a further region of similarity C-terminal to their PH domains. This region [“citron-homology” (CNH) domain] was identifiedas being homologous to the mouse RhoGTP/RacGTP-binding protein, citron (36) and to the C-terminal regions of several Ste20-like proteinkinases (Fig. 2b). A novel domain family (VHS) of unknown function(s) also has been detected in Vps27, Hrs, and STAM, and other proteins.

A conserved domain in Cdc25p-like proteins mediates their activities as guanine nucleotide exchange factors for Ras or Ral (37). Each ofthese molecules contain N-terminal extensions. We find additional amino acid similarities in these regions, and these represent a novel domainfamily (Fig. 3). Surprisingly, this domain (which we call RasGEFN) can be contiguous to, or far from, the catalytic domain. A construct ofp140 Ras-GRF that lacks this region is constitutively active (38), so it is likely that the RasGEFN domain performs a suppressor function.

Deducing Functional Features of a Domain Family Via a Protein Hit. Although rare, we have identified additional members of adomain family in regions of proteins that already have been shown to perform particular functions. Such findings often suggest comparablefunctions for all other members of the domain family. The ubiquitin-associated (UBA) domain (Table 1) has been shown to be contained inseveral enzymes implicated in ubiquitination (39). We have identified a UBA domain in a region of p62, a phosphotyrosine-independent ligandof the p56lck SH2 domain (40) that is known to bind ubiquitin (Fig. 2c). Ubiquitin-binding functions are predicted for other UBA domains.

Prediction of Cellular Function. Particular domains have been implicated in certain cellular events. For example, DEATH domains(Table 1) are present in proteins associated with apoptosis and/or axonal guidance (41, 42). Recent reports (43, 44) identify the rostralcerebellar malformation gene product (rcm) and similar homologues as putative netrin receptors. These reports do not indicate the presence of aDEATH domain in rcm or its homologues, even though the domain’s presence may be readily demonstrated by sequence analysis (Fig. 2d) orfrom its identification in the rcm Caenorhabditis elegans orthologue, UNC-5 (41). As the DEATH domain of UNC-5 is not annotated indatabases, this is one of many instances where the potential of domain identification to predict cellular function has been unfulfilled. DEATHdomains often form homotypic or heterotypic dimers (42). Because DEATH domain-containing proteins UNC-44 (45) and the putative netrin-receptor UNC-5 are known to be involved in axonal guidance, we predict that transduction of the netrin-initiated signal involvesheterodimerization of UNC-5 and UNC-44 DEATH domains.

Identification of Signaling Domains in Genes That Are Involved in Diseases. A recent study of 70 positionally cloned human genesmutated in diseases found that a significantly high proportion of these “disease genes” possess roles in cell signaling (7). In accordance withthis, the SMART alignment database contains several novel signaling domains in these genes (including the DEATH domain in rcm-like netrinreceptors, see above). Fig. 2e shows the modular architecture of pyrin (46) (also called marenostrin: ref. 47). Mutations in the pyrin gene resultin Mediterranean fever syndromes that are inherited inflammatory disorders. In addition to its ret-like zinc finger, pyrin/marenostrin and otherbutyrophilin-like homologues contain a SPRY domain, a domain of unknown function found triplicated in ryanodine receptors and singly inother proteins (48) (Table 1). Midline 1, a pyrin-homologue that also contains a SPRY domain, is mutated in patients with Opitz G/BBBsyndrome (49).

Identification of Domains in Different Phyla. The range of species in which a particular domain type is found can correlate with theevolution of specific signaling pathways; many of the known cascades are expected only in animals or eukaryotes (3). Thus, identification ofDAG kinase homologues in yeast and eubacteria (Fig. 2f) is clearly a surprise. Although further experimentation is required to infer functionalfeatures, the presence of conserved, presumably catalytic, residues in the alignment (data not shown) and the occurrence of DAG kinaseactivities in prokaryotes (50) suggests that the yeast and bacterial DAG kinase homologues possess similar molecular, but perhaps not cellular,roles to those of their animal and plant homologues.

Significance of Domain Detection and Functional Prediction. Annotation of molecular function in sequence databases and even in theliterature is difficult to interpret given that the term function may describe phenomena occurring at distinct levels, such as those of amino acids,domains, proteins, molecular complexes, cells, or organisms. Nevertheless, the examples shown above demonstrate that annotation of a certaindomain can provide useful hints toward experimental characterization of function at different levels. Domain identification also might provide acounter-argument to a previously proposed molecular function. For example, identification of a PH domain and the absence of a detectabletransmembrane region in a supposed integrin from C.albicans (Fig. 2g) argues strongly against its proposed role in cell adhesion (51). Integrinsare transmembrane proteins that link the extracellular matrix with the cytoskeleton and normally contain, except for the B-4 subunit, shortcytoplasmic sequences. The finding of a PH domain and high sequence similarity to S.cerevisiae BUD4 argues for its signaling role in bud siteselection.

DISCUSSION

Many proteins are multidomain in character and possess multiple functions that often are performed by one or more component domains.A Web-based tool (SMART) has been designed that makes use of mainly public domain information to allow easy and rapid annotation ofsignaling multidomain proteins. The tool contains several unique aspects, including automatic seed alignment generation, automatic detectionof repeated motifs or domains, and a protocol for combining domain predictions from homologous subfamilies. The ability of SMART toannotate single sequences or large datasets is exemplified by the cases described in Results, including annotation of the complete set of yeastORFs.

Currently, large-scale or genome analysis is commonly performed by annotating ORFs with a single “best hit” from similarity searches.Ambiguities whether hits represent orthologs (i.e., homologues in different organisms that arose from speciation rather than intragenomeduplication and are likely to have a corresponding function; ref. 52) or else paralogs (other members of multigene families) are not solved andomission of domain annotation also leads to misprediction of function. As most signaling proteins are multidomain in character, onlyannotation at the domain level avoids ambiguities in assigning homologies and functions to sequences, which may propagate further onadditional findings of homology. Furthermore, deduction of the modular architecture is essential for the understanding of the complexities ofmultidomain eukaryotic signaling molecules; current annotation, however, does not adequately provide this information (Table 1). As examplesof this, the existence of noncatalytic signaling

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5863

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 25: (NAS Colloquium) Computational Biomolecular Science

domains cannot be deduced from the current yeast genome directory (8) and no human RasGEF domains currently are annotated in SwissProt.Graphical representation of the complement of modular proteins in a completed genome (e.g., the 622 signaling domains in 420 yeast proteins:http://www.bork.embl-heidelberg.de/Modules/syeast.html) might provide the basis for relating experimentally derived information concerningdomains and multidomain proteins, to cellular events such as signaling.

Although other collections, such as PROSITE, Pfam, BLOCKS, and PRINTS, contain many more distinct domains or motifs, the focus of SMARTon signaling allows significantly enhanced detection sensitivity, the inclusion of many families that are not represented in other collections, andoffers a high level of specificity (i.e., a low rate of false positives that is essential for large-scale analysis). The SMART database shall becontinually updated; alignment updates shall be semiautomated to avoid misalignments. Thus, forthcoming SMART database versions shall behand-checked to provide datasets of high quality. In future, experimental findings that advance the understanding of domain structure andfunction also shall be provided via updates. As SMART is designed to obtain biologically relevant results without dependency on a singledatabase search technique, there is potential to modify underlying methods to improve performance.

Note Added in Proof. Recent improvements to the SMART system include implementation of SWise-derived E-values and addition ofmore than 80 extracellular domains. A ProfileScan Server (http://ulrec3.unil.ch/software/PFSCAN_form.html) has appeared recently thatincludes facilities that are similar or complementary to those of SMART.

We thank colleagues at the European Molecular Biology Laboratory and Ewan Birney for many helpful discussions. We also thankBemhard Sulzer for computational assistance. C.P.P. is a Wellcome Trust Career Development Fellow and a member of the Oxford Centre forMolecular Sciences, and was supported in part by a European Molecular Biology Organization Short-Term Fellowship. J.S. and P.B weresupported by the European Union, Bundesministerium für Bildung und Forschung (Germany), and the Deutsche Forschungsgemeinschaft.1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.2. Pearson, W.R. (1991) Genomics 11, 635–650.3. Doolittle, R.F. (1995) Annu. Rev. Biochem, 64, 287–314.4. Bork, P., Downing, A.K., Kieffer, B. & Campbell, I.D. (1996) Q. Rev. Biophys. 29, 119–167.5. Bork, P., Schultz, J. & Ponting, C.P. (1997) Trends Biochem. Sci. 22, 296–298.6. Ponting. C.P., Schultz, J. & Bork, P. (1997) Trends Biochem. Sci. 22, Poster Suppl. C04.7. Mushegian, A.R., Bassett, D.E., Jr., Borguski, M., Bork, P. & Koonin, E.V. (1997) Proc. Natl. Acad. Sci. USA 94, 5831–5836.8. Mewes, H.W., Albermann, K., Bahr, M., Frishman, D., Gleissmer, A., Hami, J., Heumann, K., Kleine, K., Muier, A., Oliver, S.G., et al. (1997)

Nature (London) 387, Suppl., 7–65.9. Bairoch, A, Bucher, P. & Hofmann, K. (1997) Nucleic Acids Res. 25, 217–221.10. Henikoff, J.G., Pietrokovski, S. & Henikoff, S. (1997) Nucleic Acids Res. 25, 222–225.11. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D. & Parry-Smith. D.J. (1997) Nucleic Acids Res. 25, 212–217.12. Sonnhammer, E.L., Eddy, S.R. & Durbin, R. (1997) Proteins 28, 405–420.13. Bork, P. & Gibson, T.J. (1996) Methods Enzymol. 266, 162–184.14. Eddy, S.R., Mitchison, G. & Durbin, R.J. (1995) Comput. Biol. 2, 9–23.15. Tatusov, R.L., Altschul, S.F. & Koonin, E.V. (1994) Proc. Natl. Acad. Sci. USA 91, 12091–12095.16. Birney, E., Thompson, J. & Gibson, T. (1996) Nucleic Acids Res. 24, 2730–2739.17. Schuler, G.D., Altschul, S.F. & Lipman, D.J. (1991) Proteins 9, 180–190.18. Ponting, C P. & Kerr, I.D. (1996) Protein Sci. 5, 914–922.19. Koonin, E.V. (1996) Trends Biochem. Sci. 21, 242–243.20. Haynie, D.T. & Ponting, C.P. (1996) Protein Sci. 5, 2643–2646.21. Ponting, C.P. & Parker P.J. (1996) Protein Sci. 5, 162–166.22. Gibson, T.J., Hyvonen, M., Musacchio, A., Saraste, M. & Birney, E. (1994) Trends Biochem. Sci. 19, 349–353.23. Russell R.B. (1994) Protein Eng. 7, 1407–1410.24. Thompson, J.D., Higgins, D.G. & Gibson, T.J. (1994) Nucleic Acids Res. 22, 4673–4680.25. Hogue, C.W.V., Ohkawa, H. & Bryant. S.H. (1996) Trends Biochem. Sci. 21, 226–229.26. Lupas, A., Van Dyke, M. & Stock, J. (1991) Science 252, 1162–1164.27. Wootton, J.C & Federhen, S. (1996) Methods Enzymol. 266, 554–573.28. Fasman, G.D. & Gilberts, W.A. (1990) Trends Biochem. Sci. 15, 89–92.29. Davis, S., Lu, M.L., Lo, S.H., Lin, S., Butler, J.A., Druker, B.J., Roberts, T.M., An, Q. & Chen, L.B. (1991) Science 252, 712–715.30. Richardson, A. & Parsons, J.T. (1996) Nature (London) 380, 538–540.31. Ilic, D., Damsky, C.H. & Yamamoto, T. (1997) J. Cell Sci. 110, 401–407.32. Schaller, M.D., Otey, C.A., Hildebrand, J.D. & Parsons, J.T. (1995) J. Cell. Biol. 130, 1181–1187.33. Knezevic, I., Leisner, T.M. & Lam, S.C.T. (1996) J. Biol. Chem. 271, 16416–16421.34. Ozaki, K., Tanaka, K., Imamura, H., Hihara, T., Karaeyama, T., Nonaka, H., Hirano, H., Matsuura, Y. & Takai, Y. (1996) EMBO J. 15, 2196–2207.35. Ponting, C.P. & Bork, P. (1996) Trends Biochem. Sci. 21, 245–246.36. Madaule, P., Furuyashiki, T., Reid, T., Ishizaki, T., Watanabe, G., Morii, N. & Narumiya, S. (1995) FEBS Lett. 377, 243–248.37. Boguski, M.S. & McCormick, F. (1993) Nature (London) 366, 643–654.38. Buchsbaum, R., Telliez, J.-B., Goonesekera, S. & Feig, L.A (1996) Mol. Cell. Biol. 16, 4888–4896.39. Hofmann, K. & Bucher, P. (1996) Trends Biochem. Sci. 21, 172–173.40. Vadlamudi, R.K., Joung, I., Strominger, J.L. & Shin, J. (1996) J. Biol. Chem. 271, 20235–20237.41. Hofmann, K. & Tschopp, J. (1995) FEBS Lett. 371, 321–323.42. Feinstein. E., Kimchi, A., Wallach, D., Boldin, M. & Varfolomeev, E. (1995) Trends Biochem. Sci. 20, 342–344.43. Leonardo, E.D., Hinck, L., Masu, M., Keino-Masu, K., Ackerman, S.L. & Tessier-Lavigne, M. (1997) Nature (London) 386, 833–838.44. Ackerman, S.L., Kozak L.P., Przyborski, S.A., Rund, L.A., Boyer, B.B. & Knowles. B.B. (1997) Nature (London) 386, 838–842.45. Otsuka, A.J., Franco, R., Yang, B., Shim, K.H., Tang, L.Z., Zhang, Y.Y., Boontrakulpoontawee, P., Jeyaprakash, A., Hedgecock, E., Wheaton, V.I.,

et al. (1995) J. Cell. Biol. 129, 1081–1092.46. The International FMF Consortium (1997) Cell 90, 797–807.47. The French FMF Consortium (1997) Nat. Genet. 17, 25–31.48. Ponting, C.P., Schultz, J. & Bork, P. (1997) Trends Biochem. Sci 22, 193–194.49. Quaderi, N.A., Schweiger, S., Gaudenz, K., Franco, B., Rugarli, E. I., Berger, W., Feldman, G.J., Volta, M., Andolfi, G., Gilgenkrantz, S., et al.

(1997) Nat. Genet. 17, 285–291.50. Loomis, C.R., Walsh, J.P. & Bell R.M. (1985) J. Biol. Chem. 260, 4091–4097.51. Gale, C., Finkel, D., Tao, N., Meinke, M., McClellan, M., Olson, J., Kendrick, K. & Hostetter, M. (1996) Proc. Natl. Acad. Sci. USA 93, 357–361.52. Fitch, W.M. (1970) Syst. Zool. 19, 99–113.53. Bork, P. & Margolis, B. (1995) Cell 80, 693–694.54. Rost, B., Sander, C. & Schneider, R. (1994) Comput. Appl. Biosci. 10, 53–60.

SMART, A SIMPLE MODULAR ARCHITECTURE RESEARCH TOOL: IDENTIFICATION OF SIGNALING DOMAINS 5864

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 26: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5865–5871, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittte, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Highly specific protein sequence motifs for genome analysis

CRAIG G.NEVILL-MANNING, THOMAS D.WU, AND DOUGLAS L.BRUTLAG*Department of Biochemistry. Stanford University, Stanford, CA 94305–5307ABSTRACT We present a method for discovering conserved sequence motifs from families of aligned protein sequences. The

method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of proteinsequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF also can generate motifs thatdescribe possible subfamilies of a protein superfamily. A disjunction of such motifs often can represent the entire superfamily withhigh specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif.stanford.edu/identify), contains more than 50,000 motifs. Foreach alignment, the database contains several motifs having a probability of matching a false positive that range from 10–10 to 10–5.Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions. IDENTIFY assignsbiological functions to 25–30% of all proteins encoded by the Saccharomyces cerevisiae genome and by several bacterial genomes. Inparticular, IDENTIFY assigned functions to 172 of proteins of unknown function in the yeast genome.

Assigning function to genes in newly sequenced genomes requires highly specific search and comparison methods (1–4). The processinvolves first identifying all ORFs or coding regions in the genome and translating them into putative protein sequences. These proteinsequences then are compared with (i) databases of individual protein sequences, (ii) databases of protein consensus sequences, or (iii) familiesof aligned proteins (4–9). Finally, the remaining unassigned proteins may be compared with known protein folds or structures by usingsequence-structure alignment or threading methods (10–16).

In large-scale searches for biological function, a high level of specificity is critical to minimize the number of false predictions madeamong the thousands of genes in a genome. Many popular sequence similarity methods calculate expectation values that can be used togetherwith a threshold to guarantee a specific level of false predictions. However such highly specific similarity search methods often sacrificesensitivity and fail to find all of the members in a particular protein family in a genome. On the other hand, protein sequence motifs usually aregenerated manually in an attempt to maximize the sensitivity while sacrificing specificity, thus giving rise to relatively high frequencies of falsepredictions (17, 18).

In this paper, we present a highly systematic and objective method for determining sequence motifs from aligned sets of protein sequencescalled EMOTIF (19). Unlike most methods that attempt to find a single “best” motif optimized at one level of sensitivity and specificity, EMOTIF

generates many possible motifs over a wide range of sensitivity and specificity. Thus, EMOTIF can generate extremely specific motifs that willproduce fewer than one expected false prediction per 10l0 tests, as well as more sensitive motifs that cover all members of a family. EMOTIF alsocan be used to find several highly specific motifs that characterize different subsets of a protein family. By combining these highly specificmotifs together in a disjunction, we can potentially describe a protein family with both high specificity and sensitivity.

We have applied EMOTIF to two large data sets of aligned proteins of families, the BLOCKS and the PRINTS databases (7, 9, 20). Together,these data sets contain nearly 7,000 alignments representing protein active sites, substrate binding sites, superfamily signatures, and so on. Byapplying EMOTIF to all of these alignments, we have generated a database called IDENTIFY, which contains more than 50,000 sequence motifs withspecificities varying from one expected false positive prediction in 105 tests to as low as one expected false positive prediction in 1010 tests.IDENTIFY can be used to scan newly sequenced ORFs from genomic sequences for function. Each IDENTIFY motif has an associated specificity,indicating the likelihood that a match is a true or false prediction.

By using the IDENTIFY database of motifs, we have scanned all ORFs in several bacterial genomes and in the yeast genome for function.IDENTIFY was able to determine the function of 25–30% of all of the proteins in these genomes, usually resulting in 3–4 motifs per proteinidentified. In particular, IDENTIFY was able to assign a function to 172 of the 833 ORFs whose function was labeled as unknown.

METHODS

Motif Substitution Groups. A sequence motif is a particular kind of representation called a regular expression (21). It represents ageneralization about the range of variability that occurs in corresponding positions across a family of protein sequences. A sequence motifrepresents variability by specifying a group of amino acids permitted in that position. In our notation, this group of amino acids is enclosed bybrackets, e.g., [ILMV]. When only a single amino acid is allowed in a position, that amino acid is represented by a single character withoutbrackets. On the other hand, when a position has no meaningful conservation, all 20 amino acids are permitted; in that case, we use the wild-card character ‘.’. For a sequence to match a motif, each of the amino acids in the sequence must be permitted by the corresponding group inthe motif. In some cases, we may relax this requirement to allow one or more mismatches.

To characterize the types of variability observed in nature, we conducted a study of amino acid groups, by using empirical studies of twodatabases of protein families. The BLOCKS

*To whom reprint requests should be addressed at: Department of Biochemistry. Beckman Center B400, Stanford University,Stanford, CA 94305–5307. e-mail: [email protected].

© 1998 by The National Academy of Sciences 0027–8424/98/955865–7$2.00/0PNAS is available online at http://www.pnas.org.

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5865

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 27: (NAS Colloquium) Computational Biomolecular Science

database (22) contains short, ungapped regions that are highly conserved, according to sequence characteristics. The HSSP database (23) containsglobal alignments of sequences based on structural alignments. We examined all possible subsets S of amino acids to find those groups that arewell conserved. We had two criteria for conservation: (i) compactness—amino acids within the group should substitute for one another withrelatively high frequency, and (ii) isolation—amino acids outside the group should substitute for those in the group with relatively lowfrequency. These criteria follow those often used in cluster analysis (24).

To measure compactness and isolation, we first used the BLOCKS and HSSP databases to provide a set of conditional counts c(a|S), whichequals the total number of occurrences of amino acid a in all aligned positions that contain the group S. Conceptually, we found all alignedpositions that contain S. and then tabulated all amino acids from those positions. Then, we computed conditional frequencies

where the quantity f(a|S) is defined only for amino acids a not in group S.For each group, we computed the expected conditional frequencies and the standard error of the proportion for amino acids outside the

group:

where c(a�) is the marginal count of amino acid a’ over all aligned positions.We then computed a separation score for each group, as follows:

where Z(a|S) is a conditional relative deviate, or Z-score. The first term represents our measure of compactness, and the second termrepresents our measure of isolation. Based on these separation scores, we found all amino acid groups that had a separation score greater thanthree standard errors, which is equivalent to a significance level of 0.01. Further details of our analysis are presented in ref. 25.

Our criteria were met by 30 substitution groups in the BLOCKS database and 51 substitution groups in the HSSP database. The HSSP databaseyielded more groups because of its larger size, and because our criterion is based on statistical significance. Twenty substitution groups wereconserved empirically in both databases, and the validation by both databases provides good evidence that these groups are indeed conserved innature. If we arrange these groups hierarchically, we obtain the set of amino acid groups shown in Fig. 1. We used these substitution groups todefine the space of motifs available to describe protein families.

Motif Enumeration and Ranking. A conserved region may be described by many possible motifs, with different levels of coverage andspecificity. To better understand the choices involved, consider the sequence alignment in Fig. 2a. We can cover all sequences in the trainingset if we select the smallest group of amino acids that accounts for all of the amino acids in each position. For example, every sequence hasmethionine in the first position, so the first position of the motif should specify M. In the second position, both phenylalanine and tyrosineoccur. The smallest group of amino acids from Fig. 1 that accounts for the entire position is [FYW], which allows tryptophan to occur inaddition to phenylalanine and tyrosine. Using this group is tantamount to inferring that this position requires an aromatic amino acid. In thethird position, no allowable group can account for the diverse amino acids that are observed, so to achieve complete coverage we must place awild-card character in this position.

FIG. 1. Substitution groups. Groups of amino acids found to occur together in columns of aligned sequences in both theBLOCKS and HSSP databases. Only groups of amino acids that occur together at a significant frequency and are separated fromall other amino acids at a level of significance of less than 0.01 are included. The substitution groups are arrangedhierarchically to show relationships between their physical properties.

The resulting motif, shown in Fig. 2b, has complete coverage, because it describes the entire training set, but it can be affected byproblems with the data. Consider again the alignment in Fig. 2a. In the eighth position from the right, every sequence but one contains aleucine. The first sequence, however, contains a proline at this position. This may be the result of a sequencing error, a rare mutation, or asequence that has been erroneously assigned to the family. In any case, if the first sequence was removed from consideration in the formationof the motif, this position in the motif would change from ‘.’ to L. Doing this reduces the coverage of the motif by one sequence, but makes itmore specific.

Even in the absence of problems in the data, motifs with high coverage generally may have low specificity, thereby resulting in falsepositives. In constructing a motif, we are faced then with a fundamental tradeoff between coverage and sensitivity. The EMOTIF algorithmexplores this tradeoff for a particular alignment by exhaustively generating all possible motifs using the allowable substitution groups andquantifying the coverage and specificity for each motif.

Another feature of our example bears discussion. The sequences can be partitioned into two subclasses based on the amino acid in thefourth position. The first group has arginine in this position, whereas the second group has lysine. All sequences in the first group have tyrosinein the final position, whereas none in the second group do. Indeed, partitioning the sequences in this way allows the conserved region to bedescribed by two highly specific motifs, rather than a single, more general one. Fig. 2c shows the motif for the first group. Thirteen positionsare more specific than the motif for the entire set of sequences, resulting in an factor of 1010 increase in specificity. Thus, by finding motifs thatcover only part of the training set, EMOTIF is potentially able to discover subfamilies within a superfamily and characterize them with a specificmotif.

We define specificity as the probability that a random sequence would match the motif. To calculate this, we assume that the distributionof amino acids in each position of a random sequence is independent and identically distributed. We use the observed distribution of aminoacids in the SWISSPROT database as an estimate for this distribution. The

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5866

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 28: (NAS Colloquium) Computational Biomolecular Science

specificity of a motif then is simply the product of the probabilities in each position. A wild-card character matches with probability 1.0. and aspecific amino acid matches with the probability taken from database. A group of amino acids matches with the sum of the probabilities of theindividual amino acids. So the probability of the motif in Fig. 2b is

FIG. 2. Aligned block of 34 tubulin proteins and two motifs representing these sequences, (a) An aligned block of 34 tubulinproteins and the sequence variation observed among them, (b) One possible sequence motif for the alignment in a that can beformed by using the amino acid substitution groups from Fig. 1. (c) A much more specific sequence motif that can be used torepresent the upper 19 tubulin sequences, which form a group more closely related to each other than to the lower 15sequences.

p(M)·1·[p(F)+p(W)+p(Y)]·[p(K)+p(R)]·1·1p(F)·�·1.

We have found empirically that this estimate accurately predicts false positive rates for matches of motifs against large protein databases,so the assumption of independence of positions is reasonable in practice.

The EMOTIF algorithm exhaustively generates all possible motifs for a particular alignment using the allowable substitution groups, andquantifies the coverage and specificity for each motif. The graph in Fig. 3 illustrates the tradeoff between these quantities. Each point in thegraph corresponds to a single motif for the alignment of 159 segments of tubulin sequences similar to those shown in Fig. 2a. The vertical axisis the specificity of the motif, which ranges from 1 to 10–44. The horizontal axis is the coverage of the motif, measured as the number of trainingsequences that the motif matches. In this case, the training set contains 159 sequences, and motifs covering fewer than 30% of the total (47sequences) were not generated. The EMOTIF algorithm uses a lower limit on coverage to help prune the search space and to allow all motifs to begenerated efficiently. Typically, the lower limit on coverage is 30%, but this value may be specified by the user. Because coverage of thetraining set is an integer, the graph consists of a series of vertical lines, one for each number of sequences covered. Note that even if two motifslie in the same vertical line, meaning that they cover the same number of sequences, they do not necessarily cover the same particular subset ofsequences.

An ideal motif would lie in the lower right of the graph, with complete coverage and maximum specificity. However, the tradeoff betweencoverage and sensitivity makes the ideal motif unattainable. Motifs at the extremes are generally undesirable. Motifs in the lower left of thegraph are very specific, accounting for only 30% of the training set. Motifs in the upper right are very sensitive, but result in a high number ofexpected false positives. Because EMOTIF displays the tradeoff between coverage and specificity explicitly, we may choose optimal motifs thatachieve a desired level of specificity. One strategy for searching a large database is to require that the expected number of false positives be lessthan one. The expected number of false positives is approximately equal to the specificity of the motif multiplied by the number of possiblematch positions in the database. For example, a search of the GenPept protein sequence database, which contains 108 amino acids, achievesfewer than one expected false positive when the motif has a specificity of 10–8 or less. This specifies those motifs below a particular horizontalposition in the graph. For searches of smaller databases, the line would be higher, and therefore, we could use more sensitive motifs. Forsearches of

FIG. 3. Enumeration of tubulin motifs by EMOTIF. EMOTIF generates all possible sequence motifs that can cover at least 30% of159 tubulin sequences in a training set. Each motif is plotted as a dot in the figure where the horizontal axis gives thecoverage of the motif (number of sequences covered in the training set), and the vertical axis plots the specificity of the motifas the probability of matching a random protein segment. The motifs occur in vertical lines because coverage is an integerquantity. The lower curve is the Pareto-optimal curve, which represents the most specific motif at each level of sensitivity.

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5867

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 29: (NAS Colloquium) Computational Biomolecular Science

larger databases, the line would be lower, and we would require more specific motifs. Given this restriction, the optimal motif for a particularlevel of specificity would be the one beneath the line having the highest sensitivity, as approximated by coverage of the training set.

The space of optimal motifs also is reduced by the principle of dominance. For any particular level of coverage, a motif that is morespecific dominates one that is less specific. On the graph, for any vertical line, a motif that has fewer expected false positives specificitydominates those with more expected false positives. A similar argument can be made for motifs with a particular level of specificity. A motifwith high coverage dominates those with lower coverage. The dominating motifs lie along a Pareto-optimal curve, shown in Fig. 3 as a linealong the lower right frontier of motifs. No motif on that line can be made more specific without reducing its coverage, nor be made to covermore sequences without reducing its sensitivity. Therefore, motifs on or near this line should be used for searching tasks. In practice, we selectthe motif on the Paretooptimal line with maximum coverage at the desired level of specificity.

Disjunctive Motifs. By allowing only part of the training set to be covered, we obtain motifs that may fail to describe an entire family orsuperfamily. thereby resulting in lower sensitivity. To solve this problem, we use disjunctive motifs to achieve high specificity and sensitivity.After we apply EMOTIF to a given training set and select an optimal motif at a given level of specificity, we can invoke EMOTIF on the sequencesthat were not covered. This generates a second motif, which in conjunction with the first motif, covers more of the training set than the firstmotif alone. This process may be continued until some coverage criteria is met, such as coverage of 90% of the training set.

To evaluate the increase in coverage possible with this approach, we obtained disjunctive motifs for each of the 7,000 multiple sequencealignments in the BLOCKS and PRINTS databases. The disjunctive motif strategy requires one parameter: a desired minimum level of specificity.We applied our strategy for five levels of specificity, from 10–6 to 10–10, by factors of 10. For each level of specificity, we measured the numberof motifs required to achieve 90% coverage for each sequence alignment. The results of our experiments are shown in Fig. 4. At a specificitylevel of 10–10, 65% of the sequence alignments had 90% coverage by a single motif, whereas at a specificity level of 10–6, 80% of the blockshad 90% coverage by a single motif. At a specificity level 10–10, 80% of the sequence alignments had 90% coverage by a disjunction of twomotifs, whereas at a specificity level of 10–6, nearly 95% of blocks had 90% coverage by a disjunction of two motifs. It appears that forreasonable levels of specificity, one or two motifs are sufficient to cover most sequence alignments reasonably well in these databases.

A disjunction of motifs may identify subfamilies in the training set. Each subfamily can be described specifically by its own motif. Forinstance, the graph in Fig. 3 shows motifs that are clustered into distinct groups. The clustering suggests the presence of several subfamilies inthe training set. In fact, the training set, which consists of tubulins, can be divided biologically into subfamilies, and the various clusters in thefigure correspond to motifs that cover α-tubulins only, β-tubulins only, both α- and β-tubulins, and α-, β- and γ-tubulins. We have developedmethods for identifying subfamilies optimally using criteria from statistics and minimum description length principles. These methods arediscussed in further detail in ref. 19.

The IDENTIFY Motif Database. We used the results of the above experiments to produce a motif database for evaluating individualsequences and searching sequence databases. At each level of specificity, we obtained approximately 10,000 motifs. The collective database ofmotifs is called the IDENTIFY database. The motifs are grouped according to the level of specificity for which they are optimal. For largedatabases requiring high specificity, motifs at the 10–10 level are most appropriate. For smaller databases requiring less specificity, motifs at the10–6 level may be appropriate.

RESULTS

Unidentified ORFs from Yeast. We have applied the IDENTIFY database to predict functions in unidentified ORFs in Saccharomycescerevisiae. At the time of the experiment (May 1997), there were 6,220 known ORFs in the yeast genome

FIG. 4. The number of motifs required to cover at least 90% of the protein family in the IDENTIFY database. EMOTIF was used togenerate one or more motifs that cover at least 90% of all the sequences in each of 7,000 alignments in the BLOCKS or PRINTS

databases at five different levels of specificity. Plotted are the number of motifs that are required to cover at least 90% of thesequences in the alignment.

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5868

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 30: (NAS Colloquium) Computational Biomolecular Science

database (http://genome-www.stanford.edu/Saccharomyces), of which 833 had no confirmed function (26). We applied the IDENTIFY database toeach translated ORF, and assigned a predicted function based on matches to motifs. Table 1 shows how many ORFs are identified by motifs ateach level of specificity. For example, using the motifs at a specificity of 10–10, we assigned putative functions to 61 ORFs. Forty-one of thesehad no annotation whatsoever, indicating that other methods (e.g., BLAST, PROSITE, etc.) had failed to identify any significant homology to aknown protein. Based on the calculated specificity of the motifs, along with the number of motifs and size of the ORFs database, the expectednumber of false positives is 0.02, so it is highly likely that all of the assignments are correct. Relaxing the procedure a little by using motifswith specificity at least 10–9 produces 86 assignments, including 59 not previously annotated. Again, the expected number of false positives isless than one. At the other end of the spectrum, the 10–7 set produced 172 predicted functions, but the expected number of false positives is 17.

To test these 172 predictions, we compared our results with those in the Sacch3D database (http://genome-www.stanford.edu/Sacch3D)(S.Chervitz, J.M.Cherry, and D.Botstein, personal communication). This database compares each of the translated ORFs in S.cerevisiae againstproteins of known structure by using sensitive alignment and threading approaches. Of the 833 unidentified ORFs, 83 had functions assignedby Sacch3D alone. 124 had functions assigned by IDENTIFY, and 48 had functions assigned by both programs. Of the 48 functions assignedby both programs, all assignments were identical. Overall, 255 of the unidentified ORFs had a putative function assigned by one or both of theprograms.

We analyzed our results at the level of motifs. The BLOCKS and PRINTS databases often contain several sequence alignments for a givenfamily of proteins. Each alignment corresponds to a different conserved segment of the protein. On average, these databases contain threesequence alignments per protein family. Therefore, a match of a sequence to several distinct motifs from the same family provides independentconfirmations of the predicted function. In the 48 ORFs with functions assigned by both IDENTIFY and Sacch3D, the IDENTIFY databasematched 137 distinct motifs. Of these 137 motif matches, 129 of the predicted functions were the same as those of Sacch3D. We believe thatindependent predictions of function provides an indication of the reliability of motif matches by IDENTIFY.

Whole Genome Analysis. We applied IDENTIFY to search for functions in all ORFs in several genomes including S.cerevisiae,Haemophilus influenzae, and Methanococcus jannaschii. To assess the performance of IDENTIFY, we tested our assignments against theannotations for each genome as follows. For those ORFs with annotations, we extracted keywords from the description, ignoring commonwords such as protein, enzyme, and domain. We also extracted significant keywords from the associated entry for the motif from the BLOCKS orPRINTS sequence alignment databases. We considered an assignment correct if the significant keywords from the genomic annotation matchedsignificant keywords from the alignment annotation. If there was no match, then the prediction was incorrect, or the annotations were eitherinsufficient or described the same function differently. To decide among these alternatives, we examined each of the remaining predictionsmanually (4,647 in total over three genomes).

Table 1. Assignment of function to 833 yeast ORFs of unknown functionSpecificity # ORFs assigned # ORFs assigned with no

annotations# Motifs assigned Expected # of false motif

assignments10–10 61 41 179 0.0210–9 86 59 238 0.210– 8 103 69 301 1.710–7 172 121 488 17

Table 2 summarizes the predictions for the seven genomes by using motifs from IDENTIFY at different levels of specificity. For eachgenome and level of specificity, the third column shows the number of correct predictions, as determined by automatic keyword matches. Thefourth column contains the number of predictions that could not be verified by automatic keyword matching, but were found to be correct bymanual inspection. In the fifth column are the number of predictions that were not confirmed by the annotations. Many of these casescorresponded to ORFs without annotations, whereas other cases showed conflicts between the annotated function and the function predicted byIDENTIFY. The conflicting predictions may be incorrect or may perhaps be plausibly related to the annotated functions. The sixth column showsthe number of incorrect predictions expected by chance, based on the number of motifs, their specificity, and the size of the genomes. In thebacterial genomes and in the yeast genome with the most specific motifs, there was less than expected incorrect predictions. The seventhcolumn shows the number of ORFs for which a function was predicted correctly by IDENTIFY. This is different from the number of correctpredictions, because each ORF may match several motifs in the database, each resulting in a predicted function. The eighth column shows thetotal number of ORFs in the entire genome, and the final column shows the percentage of ORFs for which a function was predicted by IDENTIFY.

Depending on the level of specificity used, the IDENTIFY program predicts functions that match the genomic annotation for 22–26% ofORFs in the yeast genome, 28–30% of the ORFs in H.influenzae, and 9–11% of the ORFs in M. jannaschii. The relatively few predictions forM.jannaschii may be because of its evolutionary divergence from those species that have been sequenced more extensively. In addition, theIDENTIFY program predicts several functions that are not confirmed by the genome annotations. Based on a 10–9 level of specificity, we predictnovel functions in 31 ORFs in yeast, 33 ORFs in H.influenzae, and 21 ORFs in M.jannaschii. On the average, three motifs are assigned to eachORF that is identified. These ORFs often represent distinct BLOCKS or PRINTS alignments from a single protein family, thus supporting each otherin the assignment of a particular function to a protein. Because these ORFs often confirm or support each other, the probability of a falsepositive prediction is likely to be much less than that of a single motif match.

DISCUSSION

Principled Motif Generation. Motifs, including those in the PROSITE database (17, 18), generally have been generated manually. In thispaper, we introduce a method for generating motifs automatically. Automated methods are becoming increasingly important as sequencedatabases grow. An automated method requires knowledge about sequence conservation. For EMOTIF, this knowledge is encoded as an allowedset of amino acid substitution groups. Although we have presented a empirical analysis that supports a certain set of groups (Fig. 1), thealgorithm may be easily adapted to use other sets of amino acid substitution groups. For instance, substitution groups based on chemicalprinciples (27, 28) may be appropriate in certain cases.

Other researchers have generated motifs from a predefined set of substitution groups (29, 30), but these sets of allowable groups oftenhave been too limited. Previous sets of substitution groups generally have been mutually exclusive, meaning that each amino acid may belongto only a single group. In contrast, we use overlapping groups, which allows each amino acid to belong to more than one group. This isbiologically

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5869

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 31: (NAS Colloquium) Computational Biomolecular Science

appropriate, because each amino acid has several properties and can serve different functions, depending on the biochemical context. In somecontexts, the size of an amino acid may be critical; in others, its charge may be the conserved property.

Table 2. Genomes scanned by using IDENTIFY

Genome Specificity Totalmotifsassigned&verified

Motifsverifiedmanually

Assignmentsunverified

Expectedfalseassignments

ORFsidentified

TotalORFs

%of totalORFsidentified

S.cerevisiae 10–10 4,442 909 9 0 1,345 6,220 22%10–9 4,679 1,027 31 5 1,466 24%10–8 4,994 1,114 124 42 1,621 26%

H.influenzae 10–10 1,804 644 11 0 479 1,697 28%10–9 1,899 703 33 0 503 30%

M.jannaschii 10–10 349 115 3 0 157 1,680 9%10–9 403 135 21 0 192 11%

M.genitalium 10–10 297 75 4 0 96 467 21%10–9 331 87 7 0 108 23%

Syn. sp. 10–10 1,369 389 21 2 447 3,169 14%10–9 1,569 461 34 20 513 16%

M.pneumoniae 10–10 304 75 6 0 101 677 15%10–9 350 89 8 0 117 17%

H.pylori 10–10 476 100 16 0 200 1,566 13%10–9 576 121 18 0 233 15%

The ORFs encoded in the genomes of S.cerevisiae, H.influenzae, and M.jannaschii were scanned by using the IDENTIFY database. The motif assignmentsthen were verified as described in the text. The number and percentage of ORFs identified by these motif assignments also were calculated. On average.approximately three motifs were assigned to each ORF that was identified.

By using only an allowed set of substitution groups, we avoid the problem of overfitting, which occurs commonly when motifs aregenerated manually. Overfitting occurs when a motif is designed to cover all variability in a training set, even when such variability may becaused by errors or may not be biologically meaningful. Errors in training sets may arise for a variety of reasons: (i) the sequence data maycontain errors, including insertions, deletions, or substitutions; (ii) one or more sequences may be misaligned; (iii) the sequences may becontaminated, meaning that some sequences in the alignment may not truly belong to a particular family; or (iv) the family may containsubfamilies or subclasses, each of which may generalize well individually, but not together. Biologically meaningless variation occurs when theobserved variation is caused by mutations that do not affect the structure or function of the protein. For instance, if a position in a proteinfamily were to contain one example each of alanine. cysteine, and valine, the observed variation likely would be biologically meaninglessbecause we know of no chemical or physical reasons that these three amino acids should be conserved together. Therefore, a motif that containsthe group [ACV] would be an example of overfitting the data. A biologically meaningful generalization of the observed variation would dependon the available substitution groups. In our set of substitution groups, these three amino acids would be generalized by the wild-card character.

Nevertheless, groups that are difficult to interpret biologically, such as [ACV], occur frequently in prosite. In that database, motifs areconstructed by using 867 distinct amino acid substitution groups. A few groups are used frequently, such as [ILMV], which occurs 826 times inprosite. In fact, the 20 most frequently used groups account for 60% of the groups used by motifs in prosite. On the other hand, the vastmajority of distinct groups—more than 70%—occur in only a single motif, and an additional 13% of groups occur in only two motifs. Thesegroups are probably examples of overfitting.

Overfitting is of concern in machine learning, because at some point, further fitting of the training set worsens performance on future testsets. For example, the group [ACV] may cover the training set entirely, but it does not allow for any other amino acid at that position, whichmay worsen predictive power if, in fact, there is no true conservation at that position.

Enumeration Strategy. EMOTIF uses an enumeration strategy that generates all possible motifs for a given protein family. It is somewhatsurprising that, in most cases, EMOTIF is able to enumerate all motifs within a few seconds. Most enumeration strategies in computer science areimpractical because the space of solutions is typically so large that a complete enumeration cannot be performed in tractable time. In fact, in anearly version of a motif generating program called SeqClass (31), we used a heuristic search strategy to find the single best motif. However,heuristic search strategies are not guaranteed to find the globally optimal solution. On the other hand, an enumerative strategy, if tractable, willguarantee an optimal solution. The tractability of EMOTIF relies on the fact that sequences in a protein family are related, so a single motif maybe the most specific one for many different subsets of the training set. Therefore, the space of possible motifs often is limited in practice by theamount of variability possible in the protein family. For additional efficiency, EMOTIF sets a lower limit on coverage of the training set; motifsthat cover less than 30% of the training set are not enumerated. The value of 30% still enables EMOTIF to recognize up to three equal-sizedsubfamilies.

Enumeration affords three major advantages over heuristic search. First, as mentioned above, it guarantees finding the optimal motif for aparticular criterion. Second, an enumeration approach finds optimal motifs for multiple criteria simultaneously. For example, EMOTIF providesoptimal motifs for a wide range of specificities, each of which may be useful for a particular task. For example, scanning an entire databasemay require highly specific motifs, whereas characterizing a single protein sequence may require motifs with much lower specificity. A singlerun of EMOTIF on a single protein family will find the optimal motif at each level of specificity in advance. We have exploited this advantage inconstructing the IDENTIFY database, which provides optimal motifs at different levels of specificity for different tasks.

The third advantage of an enumeration strategy is that it produces a two-dimensional graph, such as in Fig. 3, which characterizesvariability in a protein family. The graph provides clues about possible subfamilies, as exemplified by the α-, β-, and γ-tubulins. In addition, theshape of the Pareto-optimal line also gives insight into the structure of the set of sequences. Bulges in the line toward the lower right indicateclusters of

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5870

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 32: (NAS Colloquium) Computational Biomolecular Science

sequences, whereas a hyperbolic line along the top and left of the graph results from sequences that form no discernible clusters. Finally, thegraph helps users view the tradeoff between coverage and specificity for various motifs and allows them to select motifs interactively.

Assigning Function to Novel Proteins. The motifs in the IDENTIFY database are particularly valuable for assigning function to newlysequenced proteins, either individually or in large-scale searches. Motifs are particularly well-suited to large-scale searching tasks. Motifs canbe used to search a database very quickly, and many fast algorithms for performing regular expression searches exist. In addition, becausemotifs in the IDENTIFY database are characterized by their specificity, a search using motifs can be tailored to provide maximum sensitivity for agiven desired level of specificity and to minimize false positives.

Each motif also is linked to the BLOCKS or PRINTS databases, which describe the family of proteins from which it was derived. Because theseprotein families typically have several members, a match to a motif may provide an association with several other members of the family. Inaddition, when a match to a motif is obtained, that motif may be used to search sequence databases, such as SWISSPROT and GenPept. for otherproteins that share this motif. This function, which is implemented in IDENTIFY, provides all sequences that may share a closely related form ofthe motif and thereby represent a particular subfamily containing the motif.

More importantly, most families in the PRINTS and BLOCKS databases are represented by several motifs, each corresponding to a differentconserved region of the family. On average, each family has 3–4 conserved regions. The presence of multiple conserved regions increases thesensitivity of a search using motifs. Furthermore, they provide additional certainty about a functional assignment, above the statistical estimateof significance, when several independent motifs match a given unknown sequence.

Motifs, such as those in IDENTIFY, are useful for assigning functions to proteins even in the absence of any homology apart from the limitedmotif regions. Unlike similarity search methods that weight every position in a sequence alignment to some extent, motifs evaluate only thosepositions that show conservation in the training set. Hence, motifs can discover function and assign a protein to a family even if that protein isso distantly related that it shows no sequence similarity outside the motifs. This explains why IDENTIFY can assign function to 172 proteins fromthe yeast genome that have no significant homology to any known protein. The frequency with which IDENTIFY assigns function to thesenonhomologous proteins (172/833=21%) is somewhat less than the frequency with which IDENTIFY assigns function to the bulk of the yeastproteins (1,621/6,220=26%). The ability of motifs to assign function by using only homology at particular positions makes them particularlyuseful for evaluating newly sequenced genomes such as M.jannaschii, most of whose proteins are not homologous to other organisms.

Currently, IDENTIFY assigns function to about 25–30% of novel protein sequences. This limit reflects, among other things, the fraction ofnewly sequenced proteins that share at least one motif with a current protein family present in the BLOCKS or PRINTS databases. As more genomesare sequenced and more protein families are defined in these databases, IDENTIFY should be able to assign function to a larger fraction ofproteins. Despite this current limitation, IDENTIFY is a valuable tool for assignment of function to newly sequenced proteins, especially in thosecases where there are no significant sequence similarities by alignment, profile, or hidden Markov methods.

Availability. Access to the EMOTIF and IDENTIFY programs is available over the Internet at http://motif.stanford.edu/emotif and http://motif.stanford.edu/identify. Nonprofit institutions wishing to install the programs locally may send requests to D.L.B. ([email protected]).Commercial and for-profit institutions can license the programs from Pangea Systems Inc. or from Stanford’s Office of Technology Licensing.

This work was supported by a grant from SmithKline Beecham Pharmaceuticals and by Grant LM 05716 from the National Library ofMedicine. T.D.W. is a Howard Hughes Medical Institute Physician Postdoctoral Fellow.1. Scharf, M., Schneider, R., Casari. G., Bork, P., Valencia, A., Ouzounis. C. & Sander, C. (1994) ISMB 2, 348–353.2. Casari, G., Ouzounis, C., Valencia, A. & Sander, C. (1996) in GeneQuiz II: Automatic Function Assignment for Genome Sequence Analysis, Pacific

Symposium and Biocomputing, 1996 (World Scientific, Kohala Coast, HI), pp. 707–709.3. Altschul, S.F., Madden. T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.4. Sonnhammer, E.L., Eddy, S.R. & Durbin, R. (1997) Proteins 28, 405–420.5. Attwood, T.K., Beck, M.E., Bleasby, A.J. & Parry-Smith, D.J. (1994) Nucleic Acids Res. 22, 3590–3596.6. Krogh, A., Brown, M., Mian, I.S., Sjolander, K. & Haussler, D. (1994) J. Mol. Biol. 235, 1501–1531.7. Henikoff, J.G. & Henikoff, S. (1996) Methods Enzymol. 266, 88–105.8. Gribskov, M. & Veretnik, S. (1996) Methods Enzymol. 266, 198–211.9. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D. & Parry-Smith, D.J. (1997) Nucleic Acids Res. 25, 212–217.10. Holm, L. & Sander, C. (1994) Nucleic Acids Res. 22, 3600–3609.11. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540.12. Holm, L. & Sander, C. (1995) Trends Biochem. Sci. 20, 478–480.13. Brenner, S.E., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642.14. Orengo, C.A., Michie, A D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton, J.M. (1997) Structure 5, 1093–1108.15. Holm, L. & Sander, C. (1997) Nucleic Acids Res. 25, 231–234.16. Hubbard, T.J.P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239.17. Bairoch, A. & Apweiler, R. (1997) Nucleic Acids Res. 25, 31–36.18. Bairoch, A., Bucher, P. & Hofmann, K. (1997) Nucleic Acids Res. 25, 217–221.19. Nevill-Manning, C., Sethi, K., Wu, T.D. & Brutlag, D.L. (1997) ISMB-97 4, 202–209.20. Henikoff, S., Henikoff, J.G., Alford, W.J. & Pietrokovski, S. (1995) Gene 163, GC17–GC26.21. Hopcroft, J.E. & Ullman, J.D. (1979) Introduction to Automata Theory. Languages and Computation (Addison-Wesley, Reading, MA).22. Henikoff, J.G., Pietrokovski, S. & Henikoff, S. (1997) Nucleic Acids Res. 25, 222–225.23. Schneider, R., de Daruvar, A. & Sander, C. (1997) Nucleic Acids Res. 25, 226–230.24. Jain, A K. & Dubes, R.C. (1988) Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ).25. Wu.T.D. & Brutlag, D.L. (1996) ISMB-96 3, 230–240.26. Cherry, J.M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R.K. & Botstein, D. (1997) Nature

(London) 387, 67–73.27. Kidera, A., Yonishi, Y., Masahito, O., Ooi, T. & Scheraga, H.A. (1985) J. Protein Chem. 4, 23–55.28. Nakai, M., Kidera, A. & Kanehisa, M. (1988) Protein Eng. 2, 93–100.29. Smith, R.F. & Smith, T.F. (1990) Proc. Natl. Acad. Sci. USA 87, 118–122.30. Saqi, M.A. & Sternberg, M.J. (1994) Protein Eng. 7, 165–171.31. Wu, T.D. & Brutlag, D.L. (1995) ISMB 3, 402–410.

HIGHLY SPECIFIC PROTEIN SEQUENCE MOTIFS FOR GENOME ANALYSIS 5871

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 33: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95. pp. 5872–5879, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

A statistical mechanical model for -hairpin kinetics

VICTOR MUN�OZ,* ERIC R.HENRY, JAMES HOFRICHTER, AND WILLIAM A.EATON*Laboratory of Chemical Physics, Building 5. National Institute of Diabetes and Digestive and Kidney Diseases. National Institutes of

Health, Bethesda. MD 20892–0520ABSTRACT Understanding the mechanism of protein secondary structure formation is an essential part of the protein-folding

puzzle. Here, we describe a simple statistical mechanical model for the formation of a β-hairpin, the minimal structural element of theantiparallel β-pleated sheet The model accurately describes the thermodynamic and kinetic behavior of a 16-residue, β-hairpin-forming peptide, successfully explaining its two-state behavior and apparent negative activation energy for folding. The model classifies structures according to their backbone conformation, defined by 15 pairs of dihedral angles, and is further simplified by consideringonly the 120 structures with contiguous stretches of native pairs of backbone dihedral angles. This single sequence approximation istested by comparison with a more complete model that includes the 215 possible conformations and 15×215 possible kinetic transitions.Finally, we use the model to predict the equilibrium unfolding curves and kinetics for several variants of the β-hairpin peptide.

As is evident from the presentations at this Colloquium, the continuous discovery of thousands of new gene sequences is producing arevolution in all aspects of protein physics, chemistry, and biology. Foremost among these is the protein-folding problem. C.B.Anfinsen, in hisNobel Prize winning experiments at the National Institutes of Health (1), showed that a denatured protein can refold spontaneously to form abiologically functional (native) structure. From this result. Anfinsen concluded that the information for determining the three-dimensionalstructure is somehow encoded in the amino acid sequence. This work has led to the realization that it should in principle be possible to calculatethe three-dimensional structure of a protein from its amino acid sequence. Calculating the structure from the sequence has become known as thefirst part of the protein-folding problem and currently engages a large number of theoretical and computational scientists. The second part ofthe protein-folding problem is to understand how a protein folds. That is, what are the kinetics and mechanism (or mechanisms) of proteinfolding? This question is in many ways more challenging because for in vitro folding the ultimate answer is a description of the distribution ofthree-dimensional structures as a function of time, as the polypeptide progresses from a nearly random set of structures to the unique, compactnative protein. An additional motivation for kinetic studies is their relation to the evolution of protein sequences. Evolution preserves proteinsequences that correspond to structures with functions that are important to the organism. Theoretical studies by Wolynes and coworkers (2)have suggested how rapid folding to the native structure is yet another evolutionary pressure.

The experimental investigation of the kinetics and mechanism of protein folding has been aided by several recent theoretical andtechnological advances. The theoretical advances include analytical approaches (2–4), simulations of simplified representations of proteins (2, 5–8), and all-atom molecular dynamics calculations (9–11). This work has painted a comprehensive picture of possible general mechanisms andhas provided a framework for experimentalists to think more clearly about the problem. It also has helped define questions, design newexperiments, and interpret experimental results. Important technological advances include the availability of a great variety of materials fromprotein engineering and peptide synthesis, the development of more rapid kinetic methods (12,13), and increased computer power. Thecombination of these advances now permits the development of an “aufbau” approach to protein folding. This approach starts with theinvestigation of isolated secondary structural elements: α-helices, β-structures. and loops. The relative simplicity of these elements shouldpermit their mechanism of formation to be described in much greater detail than is possible for proteins. Such studies include the developmentof statistical mechanical models which quantitatively reproduce equilibrium populations and kinetic progress curves. Once the kinetics andmechanism of the elements are understood, it should be possible to investigate structures of increasing size and complexity.

We have begun to study secondary structural elements by using nanosecond-resolved kinetic methods and statisticalmechanical modeling(14). The thermodynamic and kinetic behavior of the α-helix has been studied for more than 40 years (15–20). Only recently, however, havekinetic measurements been made on helices of size and composition comparable with those found in proteins (21–23). Also, early theoreticalstudies (16, 17) were limited by the lack of computer power, preventing the detailed modeling of experimental kinetic data on helix formationthat is now possible (13). The experimental and theoretical study of the kinetics of loops and β-structures is a new subject. Jones et al. (24) andHagen et al. (25, 26) used a nanosecond photochemical triggering method to study loop formation in cytochrome c by determining the diffusionlimited rate for an intramolecular ligand-binding reaction. We also recently reported a thermodynamic and kinetic study of a β-hairpin formedby the 16 C-terminal residues of streptococcal protein G B1 (Fig. 1) (27). This peptide had been shown to adopt the β-hairpin conformation byBlanco et al. (29) using NMR spectroscopy. Our β-hairpin experiments consisted of measuring the thermal unfolding curve for the 16-residuepeptide between 273 K and 363 K and measuring the relaxation kinetics following 15-degree nanosecond laser temperature jumps to finaltemperatures ranging from 288 K to 328 K (27). The three principal experimental results from this study were:

*To whom reprint requests may be addressed: e-mail: [email protected] or [email protected]–8424/98/955872–8$0.00/0PNAS is available online at http://www.pnas.org.

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5872

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

β

Page 34: (NAS Colloquium) Computational Biomolecular Science

(i) the β-hairpin peptide exhibits two-state behavior in both its equilibrium and kinetics; (ii) the apparent activation energy for the folding ratecalculated from the two-state analysis is negative; and (iii) the rate of β-hairpin formation is much (>10-fold) slower than that of the α-helicesthat have been studied up to now in short peptides.

FIG. 1. Chemical, structural, and schematic representations of the β-hairpin. The sequence corresponds to the C-terminalfragment containing residues 41–56 of protein G B1 (28). Dashed lines indicate hydrogen bonds or hydrophobic interactions.

To explain these results, we used a simple statistical mechanical model which was only briefly described (27). Here, we present a detaileddescription of the model, test one of its major approximations, and use the model to predict kinetic and equilibrium properties expected forother β-hairpinforming peptides. We shall see that analysis of β-hairpin thermodynamics and kinetics addresses many of the same issues thatarise in considering the folding of a small protein.

Description of the Model. Our objective has been to develop a model for protein secondary structure kinetics, which can be used toanalyze experimental data and to predict new experiments. In this work, the model is applied to a β-hairpin, but it also can be applied to helicesand is readily adapted for more complex structures. We adopt a description, which uses pairs of �,�, dihedral angles to define the conformationof each molecule; the complete native structure is formed when all of the residues have native values for these angles. Formation of the nativestructure is opposed by the loss of conformational entropy and favored by the formation of stabilizing interactions, i.e., hydrogen bonds andhydrophobic interactions (Fig. 1). The model postulates that two groups interact only when all of the dihedral angles of the sequenceconnecting them are native. This restriction considerably simplifies the model by identifying three-dimensional structures with sequences ofpeptide bond conformations.

A second simplifying step is to consider only two conformations for the backbone dihedral angles, native and nonnative (in a spirit similarto the “correct” and “incorrect” parameter of the Zwanzig model; ref. 30). The nonnative conformation of a dihedral angle pair is not a uniqueconformation but is the set of all conformations that are incompatible with the native structure. An additional feature of the model is that pairsof �,� dihedral angles are assumed to rotate between native and nonnative values simultaneously.† We chose the dihedral angles � of residue iand � of residue i + 1 (Fig. 2) so that the peptide bond, rather than the residue, is the conformational unit. Formation of a backbone-backbonehydrogen bond is therefore associated with the transformation of one pair of �i, �i+1 angles in each β strand from nonnative to native values.

FIG. 2. Choice of dihedral angle pairs for motion in elementary kinetic steps.

In our thermodynamic description of the β-hairpin, we consider only three factors. These are the stabilizing effect of the hydrogen bondsbetween the backbone carbonyl and amide of the N- and C-terminal β strands, the stabilizing effect of the three hydrophobic interactions amongthe four side chains of the hydrophobic cluster (Fig. 1). and the destabilizing effect of the loss of conformational entropy when fixing pairs ofdihedral angles in the native hairpin conformation. Nonnative interactions, such as wrong hydrogen bonds or hydrophobic interactions, areignored. We also ignore electrostatic interactions among the charged side chains and chain termini (their importance could be assessed byexperiments on the ionic strength dependence of the equilibrium and kinetics which have not yet been performed). Each thermodynamic factoris considered to be homogeneous, i.e., independent of side chain and position in the native structure. We assume that the free energies offormation for each of the three hydrophobic interactions, ∆Gsc, are identical. Each of the backbonebackbone hydrogen bonds, including the onein the turn region, is assumed to have the same free energy, ∆Ghb. The conformational entropy loss for the strand and turn regions also isassumed to be the same (∆Sconf). which is equivalent to assuming that the residues in the turn have a propensity for this conformation equal tothe propensity of the strand residues to be in a strand conformation. To further reduce the number of parameters, we assume that the hydrogenbond is purely enthalpic, i.e., ∆Ghb=∆Hhb and that the hydrophobic inter

†When pairs of dihedral angles are used instead of single dihedral angles, the specification of a pair of angles produces a problem inphasing between the loss of entropy and the compensating decrease in interaction free energy. Either choice of �,�, pairs represents acompromise. This can be illustrated by considering the formation of a six-residue β-hairpin with a side-chain interaction betweenresidues two and five. To form the backbone-backbone hydrogen bond requires native values for four dihedral angles, �3,�3,�4,�4. Ifwe were only concerned with hydrogen bond formation, as in helix-coil theory for homopolypeptides, then the natural choice for thedihedral angle pairs would be the � and � associated with the same residue—in this case the two pairs �3,�3, and � 4,�4. With thischoice, however, formation of the two- to five-side-chain interaction requires that eight dihedral angles assume native values—whenonly six. i.e., �2,� 3,�3,�4,�4, and � 5, actually are required. So, in choosing �i,�i+1 instead of �i,�i, pairs, we overestimate the loss inentropy associated with formation of the first hydrogen bond, in favor of accurately representing the compensation between entropyloss and formation of side-chain interactions in subsequent steps.

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5873

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 35: (NAS Colloquium) Computational Biomolecular Science

actions are temperature-independent over the temperature range studied.For the elementary kinetic steps (motion of individual dihedral angle pairs), we choose a transition state that can be described in terms of

the equilibrium thermodynamic parameters. It is natural to assume that there is an entropy barrier to forming a native dihedral angle pair, so weequate the entropy of activation to the equilibrium entropy loss. For some steps, native dihedral angle pair formation is not associated withstabilizing interactions, whereas in others it is associated with the formation of hydrogen bonds or both hydrogen bonds and hydrophobicinteractions. We assume that all native interactions are broken in the transition state. Also, we include the possibility that these steps have anactivation barrier,Eo, in addition to the barriers imposed by the equilibrium free energy changes.

We must next decide how to treat the temperature dependence of the prefactor for these kinetic steps because it will have a significanteffect on the height of the potential energy barrier required to fit the kinetic data. In investigating the viscosity-dependence of theconformational relaxation rate of myoglobin, Ansari et al. (31, 32) found that the data could be well-represented by a preexponential factorproportional to 1/(σ +η), where η is the solvent viscosity and σ is the contribution to the effective friction from interacting protein atoms (4 cPin myoglobin). A much greater fraction of the β-hairpin peptide atoms interact with solvent, so we expect σ to be smaller. Simulations of β-hairpin formation by Klimov and Thirumalai (33) suggest a 1/η dependence (σ=0) so, in the absence of direct experimental data, we use aprefactor proportional to 1/η. The net result is that the model is completely defined by only five parameters—three equilibrium parameters,∆Hhb, ∆Gsc, and ∆Sconf, and the two kinetic parameters, ko (To), the preexponential factor at the reference temperature, and an activation energy, Eo.

A final, major simplifying feature in the model is the single sequence approximation first used by Schellman (34) in describing the helix-coil equilibrium and recently by us in describing helix-coil kinetics of a 21-residue peptide (23). In the single sequence approximation, onlyspecies with a contiguous run of native peptide bonds are considered. All other structures are ignored. For the β-hairpin peptide, which has 16residues (15 peptide bonds), there are 215 (=32,768) possible molecular conformations. The single sequence approximation reduces this numberto 121. In helix-coil theory, the justification for the single sequence approximation is the expectation that for short polypeptides there is a lowprobability of nucleating more than one stretch of helix in any individual molecule. For the β-hairpin, we give the justification a posteriori bycomparing with a more complete model in which the approximation is not made.

Partition Functions. The nonnative conformation of the peptide bond (coil, c) is taken as the reference state and assigned a weight of 1.The weight of a peptide bond in the native conformation (hairpin, h) is exp(∆Sconf/R), and the weight for a single stretch of j contiguous nativepeptide bonds, starting with peptide bond i [i.e., the � of residue i and � of residue i +1 (Fig. 2)], is:

wj,i=exp[–(∆Gj,i–jT∆Sconf)/RT];∆Gj,i` p∆Hhb+q∆Gsc, [1]

where p is the number of backbone-backbone hydrogen bonds and q is the number of side-chain-hydrophobic interactions in the nativestretch. In this model, there are 215 conformations for the 16-residue hairpin arising from all of the possible combinations of hs and cs. Theweight of each of these conformations is simply the product of the weights of each of the native stretches that it contains, and the partitionfunction is the sum of the 215 weights.

The model can be greatly simplified by considering only those species which contain a single stretch of native peptide bonds (the“standard” single sequence approximation). This simplification results in a model with only 121 species with the partition function:

[2]

where n+1 is the total number of residues (16 in this β-hairpin). The equilibrium probability of the all-coil conformation is P0,0=1/Q andthe equilibrium probability for all other conformations is Pj,i=wj,i/Q.

To test the accuracy of this standard single sequence approximation, we compared the equilibrium curves of the model with and withoutthis approximation in Fig. 3. The approximation significantly overestimates the fraction of folded hairpin. This problem arises because thestandard single sequence approximation does not properly account for the entropy of the system, as has been discussed by Qian and Schellman(35) for the helix-coil transition. The population of each of the 32,647 ignored species [such as cchcchcccchcccc with a weight of exp (3∆S/R)]is quite small, but because their number is large, their contribution to the entropy of the system is significant. In particular, most of the ignoredspecies do not contain significant hairpin structure, and ignoring them underestimates the stability of the unfolded hairpin. The number ofspecies ignored by the standard single sequence approximation grows geometrically with peptide length, precluding its application to moleculesof different length.

The underestimation of the entropy can be minimized by defining a “coil” state that includes not only the all c species (ccccccccccccccc)but also all the possible combinations of h and c peptide bonds that do not have just one single native stretch. For all those conformations, weignore native interactions (even for a species such as ccchhhhhhhhhchc, which has the backbone conformation of the β-turn as well as residues

FIG. 3. Comparison of thermal unfolding curve for the β-hairpin predicted by standard single sequence (121 species) andcomplete (32, 768 species) models. The fractional population of molecules containing the intact hydrophobic cluster isplotted vs. temperature. The points are derived from a two-state analysis of the fluorescence equilibrium curves. The dashedcurve is the fit to the data using the standard single sequence (121 state model) partition function (Eqs.1 and 2). Thecontinuous curve is predicted by the 215-state partition function using the parameters from the fit with the standard singlesequence model (∆Sconf=–3.09 cal mol–1 K–1, ∆Hhb=–0.86 kcal mol–1, ∆Gsc=–2.19 kcal mol–1).

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5874

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 36: (NAS Colloquium) Computational Biomolecular Science

Y45 and F52 in position to make a hydrophobic interaction). The weight of the coil state now becomes: where:

[3]

where the second term eliminates the contribution to the coil state by conformations with a single stretch of native peptide bonds (see Eq.1). The partition function in this “modified” single sequence approximation is:

[4]

Rate Equations. To transform the equilibrium description of the model with the modified single sequence approximation into a kineticmodel, we begin by assuming that conformations are connected if they can be interconverted by single h→c or c→h transitions. Species thatcontain a single stretch of native peptide bonds are connected to those other species that contain one more or one less native peptide bond ateither end of the stretch. The rate constant for adding native peptide bond i to a native stretch, ki

+, is given by:

[5]

where ko is the preexponential factor at the reference temperature To, ηo is the solvent viscosity at To. and Eo is the activation energy forrotation of the peptide bond. The rate constants for removing native peptides bonds i or i+j–1 from a native stretch of length j that starts atresidue i (Fig. 2) are given by:

[6]

It is less straightforward to treat the contribution to the overall kinetics of the system of those additional species that now have beenincluded in the coil state. For example, a coil conformation such as cchhhchhccccccc can convert to a single sequence conformationcchhhhhhccccccc by a single c→h transition. We assume that the rate for this process is equal to k6

+(Eq. 5) times the probability of finding thisparticular conformation within the coil state (i.e., exp(5∆Sconf/R)/w0.0 for the above example). We then can define an overall rate that is thesummation of the rates for all possible transitions between the coil state and each conformation with a single native stretch. The overall rate forgoing from the coil state to a conformation with a stretch of j native peptide bonds starting at residue i is given by:

[7]

[8]

where

Using these rates (Eqs. 5–8), the population of the 121-molecular species of the model as a function of time is described by the followingset of master equations:

[9]

Despite its complexity, this treatment of the kinetics maintains detailed balance. Moreover, it implicitly includes all the kineticconnections involving single h→c or c→h transitions for each of the 120-single sequence species without increasing the size of the rate matrix.The physical description in this approximation is, however, somewhat artificial. For example, our definition of the coil species requires that ac→h transition which does not occur at the end of a native stretch (such as ccchhhhhhhhcccc→ccchhhhhhhhchcc), transforms the moleculeback to the coil state instead of closer to the fully formed hairpin. However, an additional single transition(ccchhhhhhhhchcc→ccchhhhhhhhhhcc) returns the molecule to a more complete hairpin conformation.

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5875

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Test of Modified Single Sequence Approximation. We tested the modified single sequence approximation by comparing it with a“complete” model that considers all 2�(=215=32,678) possible conformations explicitly. To perform the test, we fit the experimental data withthe modified single sequence approximation model to obtain parameters that were then used in simulations using the complete model. The fitand simulations of the equilibrium data are shown in Fig. 4a. The equilibrium description is rather similar for both models, in contrast to thestandard single sequence approximation (Fig. 3). This result confirms our interpretation that underestimation of the entropy of the unfoldedensemble is the main deficiency in the standard single sequence approximation. In the modified single sequence approximation, however, thereis a small overestimation of the fraction of unfolded hairpin. The major contribution to this difference is the small subset of species that hassignificant β-hairpin structure (including stabilizing interactions) but are counted as species of the coil state (which have no stabilizinginteractions) in the modified single sequence approximation.

We also tested the kinetic description with simulations carried out with the complete model, in which there are n2" (=

We also tested the kinetic description with simulations carried out with the complete model, in which there are n2" (=

Page 37: (NAS Colloquium) Computational Biomolecular Science

49, 520) possible kinetic transitions. Only 450 of those are explicitly included in the modified single sequence approximation model. The fittingto the kinetic experiments with this model was performed by floating its five parameters to produce the best least-squares fit to the observedprogress curves for the nine experimental temperature jumps between 288 and 328 K (Fig. 4c). This was carried out using the equilibriumpopulations at the initial temperatures (before the T-jump) and integrating the rate equations (Eq. 9) by using rate constants evaluated at thefinal temperatures (after the T-jump). These parameters were then used in kinetic simulations with the complete model. The rate matrix for thismodel was constructed using an automatic pattern-matching algorithm (E.R.Henry, unpublished data).‡

FIG. 4. Comparison of thermal unfolding curves and kinetics for modified single sequence and complete models, (a)Fractional population of the hydrophobic cluster as a function of temperature. Derived from a two-state analysis offluorescence equilibrium curves (large dots). Fit to the data with the modified single sequence model (Eqs. 3 and 4),producing the parameters ∆Sconf=–2.74 cal mol–1 K–1, ∆Hhb=–0.96 kcal mol–1, ∆Gsc=–1.94 kcal mol–1 (dashed line).Calculated with the complete model using these parameters (continuous line). Fraction of native hydrogen bonds calculatedusing the model with modified single sequence approximation (dotted line), (b) Simulations of progress curves for thecomplete model (continuous line) and the model using the modified single sequence approximation (dotted line). Thefractional population of the hydrophobic cluster vs. time is plotted following a temperature jump from 283 to 298 K. Thedashed lines are single exponential fits to the simulated progress curves at limes >10 ns, the resolution of the T-jumpinstrument. The fits of the modified single sequence model to the kinetic data were performed using the LSODA routine (36).which incorporates algorithms for solving both stiff and nonstiff systems of equations. The resulting parameters werek0=8.0×108 s–1 and E0=0 (equilibrium parameters same as in a. The equilibrium and kinetic parameters are slightly differentfrom those reported by Mun �oz et al. (27) for two reasons. One is that in the previous work the viscosity dependence was notincluded in the preexponential factor, and the second is that in the present work the kinetic and equilibrium data were fitsimultaneously, whereas in the previous analysis (27), the equilibrium data were fit independently, (c) Arrhenius plot ofrelaxation times following 15 degree temperature jumps. The points are the experimental relaxation rates, whereas the dashedcurve through the points is obtained from the fit to the data using the modified single sequence model. The continuous curveis obtained from single exponential fits to the kinetic progress curves generated by the complete 215-state model using thekinetic parameters from the modified single sequence model.

The results for the two models are very similar, with fluorescence progress curves that can be represented as a biexponential process ineach case. There is initially a small amplitude phase, corresponding to very rapid reequilibration among conformations in the global-free energyminima of the folded state, followed by a slower large-amplitude phase, corresponding to crossing of the free energy barrier separating thefolded and unfolded states (see Fig. 5 and below). Overall, the agreement between the two models must be considered very good and justifiesthe use of the modified single sequence approximation. There are, however, significant differences, and the relaxation rates for the major phase(the only one detected experimentally) are about a factor of three faster in the complete model (Figs. 4 b and c). This effect is produced becausethe modified single sequence approximation ignores the stabilizing interactions in the rates connecting conformations included in the coil statewith the conformations in the folded state (with a stretch of seven or more native peptide bonds). For example, the transitioncccchhhhhhchccc→cccchhhhhhhhccc is less probable in the simpler model because ignoring the two hydrogen bonds of the startingconformation lowers its population by a factor of 25 [=exp(–2∆Hhb/RT)].

Predictions for Other β-Hairpins. An important consequence of having a statistical mechanical model for β-hairpin formation is that itcan be used to make specific predictions that can be tested experimentally. A useful way of examining the results of the model is to consider thefree energy as a function of the fraction of native peptide bonds, its natural reaction coordinate (Figs. 5a and 6a). The model postulates thatformation of a β-hairpin in the absence of side-chain

‡The system of equations is stiff and was integrated using an iterative multi-step backward differentiation formula method (37), asimplemented in the CVODE package (36, 38). This algorithm requires the solution of a set of nonlinear algebraic equations by Newtoniteration at each time step. Each Newton iteration in turn requires solving an NxN linear system A∆P=residual, where the matrix A isderived from the rate matrix K. For n=32,768, this problem is rather too large to solve using standard methods (39). However, thematrix A is sparse, containing only �500,000 nonzero elements of a possible 109. Therefore, an iterative generalized minimal residualmethod (40) appropriate for large sparse linear systems, as implemented in the CVODE package (36, 38). was used. The performanceof the algorithm was improved dramatically in this application by Jacobi (diagonal) preconditioning or very simple block-diagonalpreconditioning (40).

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5876

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 38: (NAS Colloquium) Computational Biomolecular Science

interactions is continuously uphill in free energy because backbone hydrogen bonds do not compensate for the loss in contbrmational entropyof forming native peptide bonds in both β-strands. Side-chain interactions are, therefore, necessary for the stability of the hairpin and determinethe position and height of the free energy barrier for hairpin formation. This hairpin is stabilized by a cluster of four hydrophobic side chains(W43, Y45, F52, and V54), making three hydrophobic interactions (Fig. 1). Based on our model, the folding free energy barrier for this hairpinis crossed when the seven central peptide bonds become native and the first hydrophobic interaction (between residues Y45 and F52) is formed.

FIG. 5. Prediction of equilibrium and kinetic properties of β-hairpins with additional interactions. Red. original hairpin; blue,hairpin with interaction in the β-turn (residues D47–K50); and green, hairpin with interaction between end residues (residuesR41–E56). (a) Free energy profiles (not including kinetic barriers), (b) Population of the hydrophobic cluster (continuouslines) and fraction of hydrogen bonds (dotted lines) for the three hairpins, (c) Arrhenius plot of the kinetics of the threehairpins. Relaxation rates (continuous lines), folding rates (dotted lines), and unfolding rates (dashed-dotted lines). Thefolding and unfolding rates have been calculated from a two-state fit to the relaxation rates and equilibrium constantsgenerated with the modified single sequence model.

Many of the predictions of the model are immediately apparent upon examination of the free energy profile. The existence of two globalminima separated by a significant free energy barrier (Fig. 5a) explains the two-state behavior and exponential kinetics. The species at thebarrier maximum has two backbone-backbone hydrogen bonds and therefore a lower energy than the coil state, explaining the apparentnegative activation energy in the two-state analysis (assuming a simple Arrhenius expression for the rate constants with a temperature-independent prefactor). The global minimum on the folded side of the free energy barrier consists of several molecular conformations, with thespecies at the lowest free energy having the intact hydrophobic cluster but not the maximum number of backbone-backbone hydrogen bondsand native peptide bonds. This result could explain why the population of the hydrophobic cluster obtained by fitting fluorescence data ishigher than the fraction of native dihedral angles estimated by NMR (29).

An interesting prediction of this model of β-hairpin formation is that local and long-range interactions have very different effects on thefree energy surface of the hairpin and, therefore, on its equilibrium and kinetic properties. To illustrate this point, we have performedsimulations with two variants. In one of these variants, we include a side-chain interaction, which could result from a favorable electrostaticinteraction between D47 and K50, which stabilizes the β-turn by 1 kcal/mol. In the other variant, a similar interaction is introduced between thefirst and last residues in the hairpin by mutating “in machina” glycine 41 to arginine. This computational experiment is similar in spirit to theprotein engineering approach to folding kinetics (41–43). When positioned in the β-turn, the interaction is local (between residues i and i + 3).and it significantly affects both the thermodynamics and kinetics of hairpin formation by lowering the free energy of all states, which containnative interactions (Fig. 5a). Both the population of species with the hydrophobic cluster and the fraction of hydrogen bonds increase at alltemperatures, and the Tm increases by ~20 K (Fig. 5b). The folding rate is accelerated by about a factor of four, whereas the unfolding rate isslightly decelerated (Fig. 5c). Because the peak of the free energy barrier stays at the same position along the reaction coordinate, the change inrates results simply from the change in the barrier height, as is commonly assumed in interpreting the effects of single residue perturbations inprotein folding. When the interaction is introduced between the end residues, it is long range (i,i+15) and its effects on the folding propertiesare rather insignificant. The Tm changes by only ~2 K, and the change in rates is very small, with the largest change in the unfolding rate. Thus,the simulations suggest that, in hairpins, the interactions closest to the β-turn exert the largest effect on the folding rate; interactions between theends of the strands may stabilize the hairpin structure but have very little effect on the folding rate.

Another important point raised by our model of β-hairpin formation is that the shape of the free energy barrier and the position of itsmaximum along the reaction coordinate are determined by a delicate balance between the loss in conformational entropy and stabilization fromside-chain interactions. To address this point, we have simulated two variants of the original β-hairpin: a hairpin with the hydrophobic clusterplaced one residue closer to the center of the molecule (W44, Y46, F51, V53) and another one with the hydrophobic cluster one residue closerto the ends (W42, Y44, F53, V55). The effects of these changes on the equilibrium and kinetic properties are shown in Figs. 6 b and c,respectively. Moving the hydrophobic cluster one residue in either direction does not modify the interaction energies of the hairpin; however, the

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5877

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 39: (NAS Colloquium) Computational Biomolecular Science

model predicts a dramatic effect on its free energy profile (Fig. 6a). If the cluster is moved closer to the β-turn, both the minimum in the foldedensemble and the top of the free energy barrier are shifted toward less structure (closer to the unfolded ensemble). This is accompanied by anincrease in stability, an acceleration of the folding rate, and almost no change in the unfolding rate. If the cluster is moved one residue in theopposite direction (toward the ends), the stability is decreased and the folding rate decreases, and there is only a small change in the unfoldingrate. Displacements of the top of the free energy barrier have been reported in folding experiments on small proteins (44). The model indicatesthat, for a β-hairpin, the top of the free energy barrier is simply determined by the position of the stabilizing side chains in the sequence.

FIG. 6. Prediction of equilibrium and kinetic properties of β-hairpins with repositioned hydrophobic cluster. Red, originalhairpin; Blue, hairpin with hydrophobic cluster moved one residue closer to the β-turn; and green, hairpin with hydrophobiccluster moved one residue closer to the ends. (a) Free energy profiles. (b) Population of the hydrophobic cluster (continuouslines) and fraction of hydrogen bonds (dotted lines) for the three hairpins, (c) Arrhenius plot of the kinetics of the threehairpins. Relaxation rates (solid lines), folding rates (dotted lines), and unfolding rates (dashed-dotted lines). The folding andunfolding rates have been calculated as in Fig. 5.

Caveats. One criticism of the model that we have presented is that only native interactions are considered. This excludes the possibility,for example, of forming a turn at additional positions in the sequence, which would result in nonnative hydrogen bonds and nonnativehydrophobic interactions. There is no evidence in the NMR data of significant population of other hairpin conformations (29). Nonnativeinteractions also can affect the kinetics in the same two ways as in proteins (2). They can produce local minima in the energy landscape, whichcan result in the population of intermediate structures at equilibrium, or they can produce transient trapping of misfolded structures, which arenot present at equilibrium. We have not yet found any evidence for equilibrium intermediates in the folding of the β-hairpin. For transienttrapping to be observable as a separate kinetic phase, the residence time in the trapped state must be longer than the relaxation time for theoverall hairpin-coil transition. Transient trapping does not appear to be occurring in this β-hairpin because the progress curves at alltemperatures can be well-fit with a single exponential function (27).

Another criticism of the model is that there are no native backbone or side-chain interactions between two residues unless the peptidebonds of all intervening residues have the native conformation. This postulate excludes the possibility, for example, of initiation by forming thehydrophobic cluster followed by zipping up of the hydrogen bonds. The transition state in this mechanism would be a ~10-residue loop. One istempted by this mechanism because of the close correspondence of the β-hairpin relaxation time and the time of ~1 µs estimated by Hagen et al.(25, 26) to form a 10-residue loop. A 10-residue loop is predicted by Thirumalai (45) to be the most probable loop size in proteins, longer loopsbeing less probable because of the larger entropy loss and shorter loops because of chain stiffness. One could possibly distinguish between thetwo mechanisms by measuring the kinetics for a β-hairpin in which the hydrophobic cluster is moved closer to the β-turn. Our model predictsthat the rate of formation should speed up because the transition state now occurs earlier along the reaction coordinate (Fig. 6a), whereas theloop model would predict a slower rate of formation. This consideration was in fact one of the motivations for the predictions discussed above.

The most convincing test of the model will of course come from measurements on other β-hairpin peptides. Another approach to bothtesting and refining the model is to examine the results of simulations. All-atom molecular dynamics simulations of temperature jump kineticexperiments (at experimental temperatures) may be feasible in the near future (10, 11). In the meantime, it should be useful to examine theresults of Langevin simulations of simplified representations of the peptide (33). Because large numbers of sufficiently long trajectories arepossible with this method, kinetic progress curves actually can be simulated. Examination of these trajectories might reveal dominantmechanisms and structural species that must be included in a kinetic model. The results will, however, depend critically on the choice ofpotential functions.

Is the model unnecessarily complex? Could we use a much simpler model even a two-state model? The main problem with a two-statemodel is that it has little predictive value. In a two-state model, one would postulate a transition state structure or, as in the case of proteins, tryto determine the transition state by structural perturbation experiments. The examples in Fig. 6 show that small structural perturbations lead tochanges in the transition state and would therefore also result in incorrect predictions for the change in rates. Nevertheless, the model could besimplified. If we consider only residue-residue interactions, then in the single sequence approximation, the model reduces to an eight-statemodel. In

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5878

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 40: (NAS Colloquium) Computational Biomolecular Science

such a model, quartets of dihedral angles change simultaneously in single kinetics steps. This model can explain the experimental data, as wellas predict the properties of other β-hairpins. It is, however, not straightforward to extend a model based on interactions to more complexstructures because of the difficulty in defining rules that specify the rates of all elementary kinetic steps in terms of just a few parameters.

We thank Attila Szabo and Peter Wolynes for helpful comments on the manuscript.1. Anfinsen, C.B. (1973) Science 181, 223–230.2. Bryngelson, J.D., Onuchic, J.N., Socci. N.D. & Wolynes, P.G. (1995) Proteins Struct. Funct. Genet. 21, 167–195.3. Bryngelson, J.D. & Wolynes, P.G. (1987) Proc. Natl. Acad. Sci. USA 84, 7524–7528.4. Orland, H., Garel, T. & Thirumalai, D.T. (1996) in Recent Developments in Theoretical Studies of Proteins, ed. Elber, R. (World Scientific,

Singapore), pp. 197–268.5. Dill K.A., Bromberg, S., Yue, K., Feibig, K.M., Yee, D.P., Thomas, P.D. & Chan. H.S. (1995) Protein Sci. 4, 561–602.6. Karplus, M. & Sail, A. (1995) Curr. Opin. Struct. Biol. 5, 58–73.7. Shakhnovich, E.I. (1997) Curr. Opin. Struct. Biol. 7, 29–40.8. Klimov, D.K. & Thirumalai, D. (1996) Proteins Struct. Funct. Genet. 26, 411–441.9. Guo, Z., Brooks, C.B. & Boczko, E.M. (1997) Proc. Natl. Acad. Sci. USA 94, 10161–10166.10. Li, A. & Daggett, V. (1996) J. Mol Biol. 257, 412–429.11. Lazaridis, T. & Karplus, M. (1997) Science 278, 1928–1931.12. Eaton, W.A., Thompson, P.A., Chan C.-K., Hagen, S.J. & Hofrichter, J. (1996) Structure (London) 4, 1133–1139.13. Gray, H.B. & Valentine, J.S. (1998) Acc. Chem. Res., in press.14. Eaton, W.A., Hofrichter, J., Munoz, V. & Thompson, P.A. (1998) Accts, Chem. Res., in press.15. Zimm, B., Doty, P. & Iso, K. (1959) Proc. Natl. Acad. Sci. USA 45, 1601–1607.16. Schwarz, G. (1965) J. Mol. Biol. 11, 64–77.17. Poland, D. & Scheraga, H.A. (1970) in Theory of Helix-Coil Transitions in Biopolymers (Academic, New York).18. Gruenewald, B., Nicola, C.U., Lustig, A. & Schwarz, G. (1979) Biophys. Chem. 9, 137–147.19. Chakrabartty, A. & Baldwin, R.L. (1995) Adv. Protein Chem, 46, 141–176.20. Mun �oz, V. & Serrano, L. (1995) Curr. Opin. Biotech. 6, 382–386.21. Williams, S., Causgrove, T.P., Gilmanshin, R., Fang, K.S., Callender, R.H., Woodruff, W.H. & Dyer, R.B. (1996) Biochemistry 35, 691–697.22. Gilmanshin, R., Williams, S., Callender, R.H., Woodruff, W.H. & Dyer, R.B. (1997) Biochemistry 36, 15006–15012.23. Thompson, P.A., Eaton, W.A. & Hofrichter, J. (1997) Biochemistry 36, 9200–9210.24. Jones, C.M., Henry, E.R., Hu, Y., Chan, C-K., Luck, S.D., Bhuyan, A.K., Roder, H., Hofrichter, J. & Eaton, W.A. (1993) Proc. Natl. Acad. Sci. USA

90, 11860–11864.25. Hagen, S.J., Hofrichter, J., Szabo, A. & Eaton, W.A. (1996) Proc. Natl. Acad. Sci. USA 93, 11615–11617.26. Hagen S.J., Hofrichter J. & Eaton W.A. (1997) J. Phys. Chem. 100, 12008–12021.27. Munoz, V., Thompson, P.A., Hofrichter, J. & Eaton, W.A. (1997) Nature (London) 390, 196–199.28. Gronenborn, A.M., Filpula, D.R., Essig. N.Z., Achari, A., Whitlow, M., Wingfield, P.T. & Clore, G.M. (1991) Science 253, 657–661.29. Blanco, F.J., Rivas, G. & Serrano, L. (1994) Nat. Struct. Biol. 1, 584–590.30. Zwanzig, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9801–9804.31. Ansari, A., Jones, C.M., Henry, E.R., Hofrichter, J. & Eaton. W.A. (1992) Science 256,1796–1798.32. Ansari, A., Jones, C.M., Henry. E.R., Hofrichter, J. & Eaton, W.A. (1994) Biochemistry 33, 5128–5145.33. Klimov, D.K. & Thirumalai, D. (1997) Phys. Rev. Lett. 79, 317–320.34. Schellman, J.A. (1958) J. Phys. Chem. 62, 1485–1494.35. Qian, H. & Schellman, J.A. (1992) J. Phys. Chem. 96, 3987–3994.36. Hindmarsh, A.C. & Petzold, L.R. (1995) Comput. Phys. 9, 148–155.37. Hairer, E. & Wanner. G. (1996) Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems (Springer. Berlin), 2nd Ed.38. Cohen, S.C. & Hindmarsh, A.C. (1994) CVODE User Guide (Lawrence Livermore Natl. Lab, Livermore, CA).39. Lawson, C.L. & Hanson. R.J. (1974) Solving Least Squares Problems (Prentice-Hall Englewood Cliffs. NJ).40. Barrett, R., Berry, M., Chan, T.F., Demmel. J., Donato. J.M., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. & Van der Vorst, H. (1993) Templates

for the Solution of Linear Systems: Building Blocks for Iterative Methods (Soc. Indust. Appl. Math., Philadelphia).41. Fersht, A.R., Matouschek, A. & Serrano, L. (1992) J. Mol. Biol. 224, 771–782.43.42. ltzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995) J. Mol. Biol. 254, 260–288.43. Onuchic, J., Socci, N.D., Luthey-Schulten, Z. & Wolynes, P.G. (1996) Fold. Des. 1, 441–450.44. Silow, M. & Oliveberg, M. (1997) Biochemistry 36, 7633–7637.45. Camacho, C.J. & Thirumalai, D. (1995) Proc. Natl. Acad. Sci. USA 92, 1277–1281.

A STATISTICAL MECHANICAL MODEL FOR Β-HAIRPIN KINETICS 5879

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 41: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5880–5883, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Coupling the folding of homologous proteins

CHEN KEASAR*†, DROR TOBI*, RON ELBER*‡, AND JEFF SKOLNICK§

*Department of Physical Chemistry. Department of Biological Chemistry, Fritz Haber Research Center for Molecular Dynamics andWolfson Center for Applied Structural Biology, Hebrew University, Givat Ram Jerusalem 91904, Israel: †Department of Structural Biology,Stanford School of Medicine, Stanford University, Stanford, CA 94305: and §Department of Molecular Biology, Scripps Research Institute, LaJolla, CA 92037

ABSTRACT The empirical observation that homologous proteins fold to similar structures is used to enhance the capabilities ofan ab initio algorithm to predict protein conformations. A penalty function that forces homologous proteins to look alike is added to thepotential and is employed in the coupled energy optimization of several homologous proteins. Significant improvement in the quality ofthe computed structures (as compared with the computational folding of a single protein) is demonstrated and discussed.

It is convenient to classify methods of predicting protein conformations into one of two main categories: (a) methods that optimize energyfunctions and (b) methods that search through databases of protein structures. In the present manuscript we call a, “energy minimizationmethods”, and b “homology.” The division is not sharp. For example, many of the energy functions that are used in the prediction of three-dimensional conformations of proteins include information extracted from databases on protein structures.

The conformation of the global energy minimum, even if we succeed to find it, may differ from the native fold because of two possiblereasons: (1) the empirical energy is inaccurate, and (2) the native fold does not correspond to a global free energy minimum. To address point1, an adjustment of the energy function may follow, whereas to address point 2 the folding pathways (and not only the most stable state) arerequired. We propose below a combination of the homology and the energy approaches. The combination improves the prediction of structuresof homologous proteins even if their conformations do not correspond to a global minimum of the individual molecules.

In the homology approach, an empirical observation on databases of protein structures is employed: Proteins with comparable sequencesadopt a similar structure in the native configuration. This information is used to build models of unknown structures. The required degree ofsimilarity between sequences is uncertain but a bet with significant safety margins is of 40% sequence identity. A model of a protein with anunknown structure can be built by using an experimental structure of a protein with a comparable sequence.

The homology protocol is the most accurate approach we have today to model protein structures on the computer. Its disadvantage is thenecessity of having a similar sequence with a known structure.

In this manuscript, we describe a connection between the two approaches that improves the performance of energy optimizationtechniques while maintaining its generality. In the next section, we describe the algorithm and an example for a “real” protein follows. Finally,we explain why the suggested coupling optimizes better than straightforward annealing. We suggest two reasons for the improvement. The firstsource of improvement is smoothing of the potential energy surface, making it more accessible for global optimization. The second reason forimprovement in the optimization is due to averaging in sequence space (over the homologous proteins) that enhances weak signals.

The Algorithm. We consider N homologous proteins with sufficient sequence identity that suggests structural similarity. The structure ofthe family is unknown, making the “energy minimization” approach the right choice to predict the structure (it is the only choice). The Nsequences are aligned, using established sequence alignment techniques (1). Here, we assume that the alignment is adequate. The examplediscussed below did not include deletions or insertion of amino acids into the sequence. However, an extension that includes deletions andinsertions is straightforward.

The energy of a homologous set of proteins. An energy function, Etotal, is defined, which includes the sum of all the individual energies ofthe homologous proteins and a coupling term that penalizes structural diversity.

[1]

Xi is the vector of coordinates for the i-th homologous protein, and εi(Xi) is its unique energy function. ∆ij(Xi, Xj) is a function thatmeasures and penalizes structural diversity between proteins i and j. The larger is the difference between the two structures; the higher is thevalue of the penalty function. In Eq. 1, we sum over the diversities of all pairs. Optimization of Etotal) provides a prediction for the structure ofthe family of the homologous proteins.

Exploring conformations. We used the Lattice Monte Carlo Program (LMCP) of Skolnick and Kolinski (2). The Monte Carlo procedureuses different moves on the lattice to modify a starting conformation of the protein chain. Each of the proteins is modified independently. Adisplacement δXi is chosen according to the LMCP protocol, so that �δXiδXj�i�j is zero (��� denotes an ensemble average). New proteinenergies—εi(Xi,+∆Xi) (i=1,�,N)—are computed and supplemented by the i-th measure of structural diversity ∆i(Xi+ δXi)= ∆ij. Thedisplacement in Xi (δXi) is now accepted or rejected according to the usual Monte Carlo criterion with an energy, εi(Xi)+∆i(Xi). The MonteCarlo test is repeated for all the homologous structures {i=1,�,N}.

‡To whom reprint requests should be addressed, e-mail [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955880–4$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: LMCP, Lattice Monte Carlo program: RMSD, root mean square difference.

COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS 5880

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 42: (NAS Colloquium) Computational Biomolecular Science

The generation of δXi (but not the decision on its acceptance) depends only on Xi and does not take into account the penalty function onstructural diversity. This protocol may lead to a large number of step rejections. However, the above choice is the simplest to use in a parallelcomputing environment, and it further leaves some room for future publications.

The parallel environment is important because the computations are pursued typically on a cluster of workstations or on a parallelcomputer with multiple CPUs. Each of the homologous proteins is assigned to one processor. The processor calculates the displacement δXiand the energy εi(Xi)+∆i(Xi) and decides whether to accept or to reject the move. The correlation with other structures is built using the function ∆i(Xi) that depends on all the other coordinates. The conformations of the set are therefore sampled from the canonical ensemble with an“energy”, Etotal.

To compute the penalty function ∆i(Xi), it is necessary to have the coordinates of all the proteins on each of the processors. The update ofthe coordinates can be done with every Monte Carlo step. However, to reduce communication overhead, we usually update the coordinates onlyafter a few Monte Carlo steps.

A related algorithm can be easily formulated for molecular dynamics, solving the Newton’s equations of motion:

where M is the mass matrix for protein i. Nevertheless, the lattice approach suggests a number of unique advantages for the protein foldingproblem, which are discussed elsewhere (3).

We now return to the functional form of the penalty on structural variations. We experiment with two measures:

(a) Root mean square difference (RMSD) of the shared Cα coordinates after optimal overlap (4):

(b) L is the fraction of dissimilar contacts in the maps of the two structures. L=(number of dissimilar contacts)/(total number ofcontacts in the two maps). (Two residues are considered at a contact if their Cα distance is ≤6.5 Å.)

The RMSD is a common and useful measure of global similarity. However, it is doing poorly in detecting similar folds of structuralsegments. For example, if the secondary structure elements are predicted correctly but their packing is incorrect, the RMSD is typically high. Incontrast, L, which is not as widely spread as the RMSD, detects local similarities and shows more uniform decrease in value as the structurequality decreases.

Both functions are useful in comparing the final structure to the native fold; however, the task of forcing the different chains to look alikeis best done with the RMSD. The application of L is problematic because maps with no contacts at all have (of course) no dissimilar contacts.As a result, restraining the structures to similar L values pushes the system to unfolded swollen states. We therefore used the RMSD. Thespecific functional form of ∆ij(Xi, Xj) is listed below:

Simulation Protocol and Results. We provide a numerical example for a family of pancreatic hormones (5). In Table 1, we list the sevensequences that were used in the runs with coupled optimizations (6).

We performed 100 Monte Carlo simulated annealing runs of the protein 1ppt and 142 coupled runs of structures of seven sequences,which were optimized in parallel. Only one experimental structure (of 1ppt) is available, and we compared with it the results of thecomputations. In Fig. 1, we show the energies of the 100 Monte Carlo runs of 1ppt as a function of the RMSD from the native structure. Alsoshown are the energies of the 142 coupled runs.

It is clearly seen that runs, which employed seven coupled proteins, cluster near lower RMSD values and therefore provide betterprediction. The lowest energy structure of the coupled and the uncoupled runs (our best guess of the native conformation) are shown in Fig. 2.Again, the coupled runs provide a better answer. The improvement does not require an increase in computational effort. Each of the uncoupled1ppt runs was seven times longer than the run of the seven sequences.

Another example for a protein family (homeodomain) can be found in ref. 6. Yet, another study employed coupling in a two-dimensionallattice (7) and showed even more profound improvement.

DISCUSSION

Here we discuss the question of ''why.” Why does the proposed algorithm improve structure prediction? We have seen one example, andother examples are available in the literature (6, 7). From a global optimization perspective, it may be surprising that optimizing a system, Ntimes larger (N homologous sequences) is easier than optimizing one sequence at a time. At the limit in which the optimizations are completelyindependent, they should take approximately N times longer.

Obviously, the coupling plays an important role in increasing the computation efficiency and accuracy. To understand the effect, it isuseful to consider a simpler system first in which the “homologous” proteins are all of the same molecule. Hence, only structural diversityremains. Etotal is now Etotal= [ε(Xi)+∆i]. A single energy function [ε(Xi)] is used for the different conformations of the proteins {i=1,�, N}.

In Fig. 3, we compare annealing results with coupled and uncoupled energy function for the protein Ifsd. The distribution clearly showsthat better energies are obtained when the coupling (of identical proteins) is employed. Hence, a better optimization protocol is obtainedwithout sequence diversity. However, it is important to emphasize that the quality of the structures (as opposed to the energies) is notnecessarily better because it depends also on the quality of the energy function.

Table 1. The seven coupled sequences that were used in the present work

PDB, protein data bank.

COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS 5881

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 43: (NAS Colloquium) Computational Biomolecular Science

FIG. 1. Comparison between coupled and uncoupled Monte Carlo runs. X, uncoupled runs; black circles, uncoupled runs.Each point is the final configuration of a simulated annealing run. Note that the coupled runs end more frequently at lowerenergies and lower RMSD values.

The Monte Carlo procedure produces conformations that are sampled from the canonical ensemble. The weight of a coupled state Xi,�,XN at a temperature T is given by Note that ∆i depends on the coordinates of the rest of the copies and that,without the coupling, we are getting the classical Boltzmann factor for a set of N noninteracting copies

Consider now, the discrete formula for quantum path integral of a system with a potential energy ε(X)where m is the mass matrix and ` is the Planck constant divided by 2π (8). For convenience, we define

and we also set N+1`1. The new “coupled” energy Ecoupled resembles

The quantum expression couples only pairs of nearest neighbor structures because the coupling corresponds to a physical entity—thekinetic energy. Nevertheless, if λ is sufficiently large (the protein is closed to be “classical”), the different structures will remain similar at eachsampling point, exactly what we wanted to achieve in the folding of homologous proteins. The similarity between Etotal and Ecoupled is thereforeself-evident and hints to the origin of the enhanced optimization as discussed below.

Expressions that are related to the above quantum expression were investigated in the global optimization field (9). The key idea in anumber of pioneering approaches was to define a new energy function. The new energy function is a local spatial average with differentchoices of densities: Eaverage(X) =�E[X]0)ρ(X, X0)dX0. ρ(X, X0) is the density. The result Eaverage(X) is a smoother potential that is easier tominimize (Fig. 4) as was shown by numerous examples (9–13). Examples for smoothing densities are (but not limited to) Gaussians (9–11),square boxes (12), and a discrete sample of points (13). For example, the above quantum expression is an average of ε(X)over a discretenumber of sampling points {X1,�, XN}. The discrete averaging has advantages and disadvantages. An advantage is its simplicity. We do nothave to perform complex integrals to obtain the average. The smoothing is done by a direct sum over the sampling points. However, becausethe number of points is small, the averaging is less effective compared with other methods that use analytical densities and integrals.Nevertheless, discrete smoothing is very suggestive for lattice calculations of the type that we present here.

FIG. 2. Comparing the native structure (a) with the lowest energy structures of the coupled (b) and the uncoupled (c) runs.

To conclude, one of the reasons that the proposed protocol improves structure prediction is because of spatial potential averaging. Thisimprovement is in the spirit of a number of recent global optimization procedures (9).

COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS 5882

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 44: (NAS Colloquium) Computational Biomolecular Science

FIG. 3. Comparing the optimized energies from multiple simulated annealing runs for coupled and uncoupled simulations.Dark bars, coupled runs; light bars, uncoupled runs.

Another important feature of the present protocol is of averaging in sequence space. We return to the N different sequences. Each of theproteins has a unique energy surface. By virtue of experimental observations, we know that all the homologous protiens share similar structuresat their native fold. At approximately the same coordinate Xnative, we expect all the proteins to be in an energy minimum. Therefore, all of theenergies are correlated and the result of the sum is a quantity, which increases linearly with N.

FIG. 4. A schematic drawing of potential smoothing using discrete, local averaging: �V(X)�=1/N V(Xi)·δX,Xi·

On the other hand, for unfolded states the energies of the different homologous structures are not necessarily correlated. Consider forexample a correlated mutation at the hydrophobic core of the protein. To maintain the compactness of the hydrophobic core of the native state,valine and tryptophan may replace a pair of phenylalanines. At unfolded conformations, it is not necessary to assume that the contacts and theenergies of the above residues are still correlated. It is more likely that the energies are not correlated. We therefore estimate

This estimate is in the spirit of the Random Energy Model as applied to proteins (14).The new energy surface Etotal is therefore distorted in a favorable way when comparing it to the original εi. The shared minimum (which

databases of protein structures support its existence) is deeper compared with other portions of the energy surface. The enhancement of the welldepth of the shared minimum may make it the global energy minimum of the new average energy even if originally it was not. Thisenhancement suggest the new protocol as possibly effective for kinetically stable proteins.

It is the combination of the spatial and sequence averaging, that provides significant improvement in structure prediction of ab initioalgorithms as discussed above.

This research was supported by a Binational Science Foundation grant (to R.E. and J.S.) and by National Institutes of Health GrantGM37408 (to J.S.).1. Sander, C & Schneider, R. (1991) Proteins 9, 56–68.2. Kolinski, A. & Skolnick, J. (1994) Proteins 18, 338–352.3. Kolinski, A. & Skolnick, J. (1996) Lattice Models of Protein Folding, Dynamics and Thermodynamics (Chapman & Hall London).4. Kabsch, W. (1976) Acta Crystallogr. A 32, 922–923.5. Glover, I. & Blundell, T. (1983) Biopolymers 22, 293–304.6. Keasar, C., Elber, R. & Skolnick, J. (1997) Fold. Des. 2, 247–259.7. Keasar, C. & Elber, R. (1995) J. Phys. Chem. 99, 11550–11556.8. Feynman, R.P. (1982) Statistical Mechanics: A Set of Lectures, (Benjamin Cummins, Reading, MA).9. Straub, J.E. (1996) Optimization Techniques with Applications to Proteins in Recent Developments in Theoretical Studies of Proteins, ed. Elber, R.

(World Scientific, Singapore), pp. 137–197.10. Piela, L., Kostrowicki, J. & Scheraga, H.A. (1989) J. Phys. Chem. 96, 4024–4035.11. Shalloway, D. (1992) Global Optimizat. 2, 281(1992).12. Andricioaei, I. & Straub, J.E. (1996) Comput. Phys. 10, 449–454.13. Roitberg, A. & Elber, R. (1991) J. Chem. Phys. 95, 9277–9287.14. Bryngelson, J.D. & Wolynes, P.G. (1989) J. Phys. Chem. 93, 6902–6915.

COUPLING THE FOLDING OF HOMOLOGOUS PROTEINS 5883

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 45: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5884–5890, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Photoactive yellow protein: A structural prototype for the three-dimensional fold of the PAS domain superfamily

JEAN-LUC PELLEQUER*, KAREN A.WAGER-SMITH†, STEVE A.KAY†, AND ELIZABETH D.GETZOFF*‡

*Department of Molecular Biology and The Skaggs Institute for Chemical Biology, †National Science Foundation Center for BiologicalTiming, Department of Cell Biology, The Scripps Research Institute, 10550 N.Torrey Pines Rd., La Jolla. CA 92037.

ABSTRACT PAS domains are found in diverse proteins throughout all three kingdoms of life, where they apparently function insensing and signal transduction. Although a wealth of useful sequence and functional information has become recently available, thesedata have not been integrated into a three-dimensional (3D) framework. The very early evolutionary development and diversefunctions of PAS domains have made sequence analysis and modeling of this protein superfamily challenging. Limited sequencesimilarities between the ~50-residue PAS repeats and one region of the bacterial blue-light photosensor photoactive yellow protein(PYP), for which ground-state and light-activated crystallographic structures have been determined to high resolution, originally wereidentified in sequence searches using consensus sequence probes from PAS-containing proteins. Here, we found that by changing a fewresidues particular to PYP function, the modified PYP sequence probe also could select PAS protein sequences. By mapping a typical~150-residue PAS domain sequence onto the entire crystallographic structure of PYP, we show that the PAS sequence similarities anddifferences are consistent with a shared 3D fold (the PAS/PYP module) with obvious potential for a ligand-binding cavity. Thus, PYPappears to prototypically exhibit all the major structural and functional features characteristic of the PAS domain superfamily: theshared PAS/PYP modular domain fold of ~125–150 residues, a sensor function often linked to ligand or cofactor (chromophore)binding, and signal transduction capability governed by heterodimeric assembly (to the downstream partner of PYP). This 3D PAS/PYP module provides a structural model to guide experimental testing of hypotheses regarding ligand-binding, dimerization, andsignal transduction.

A large and growing set of multidomain protein sensors and transcription factors involved in signal transduction include PAS domainsequences (http://www.whoi.edu/biology/hahnm.html). The PAS acronym was coined originally (1) to describe the ~270-residue regionencompassing two direct sequence repeats (PAS-A and PAS-B) of ~50 residues each, that had been identified in the Drosophila Period clockprotein (PER) (2), the vertebrate Aryl hydrocarbon receptor nuclear translocator (ARNT) (3). and the Drosophila Single-minded (SIM) (2).These three proteins are involved in regulation of circadian rhythms, activation of the xenobiotic response, and cell fate determination,respectively. More recently, PAS domains have been found in many other proteins, including histidine-kinases (4), light receptor and regulatorproteins (5), clock proteins (6, 7), sensor proteins (oxygen/redox sensors), ion channels (5), and a Ser/Thr kinase with a putative redoxsensingor flavin-binding domain, in which PAS regions are named “LOV” (light, oxygen, or voltage) (8). These PAS-containing proteins occur in awide range of living organisms including: eubacteria, archaca, cyanobacteria, fungi, plants, insects, and mammals (5). PAS-containing proteinshave been categorized (5) into three functional subgroups: (i) transcription activators [DNA-binding proteins with both basic helix-loop-helix(bHLH) and PAS sequence motifs], (ii) sensor modules of two-component regulatory systems (oxygen sensor, nitrogen fixation, sensor kinase,etc.), and (iii) ion channels (in eucarya).

One function of the PAS domain is to mediate protein-protein interactions (9, 7). Dimerization has been demonstrated for manytranscriptional activators such as the aryl hydrocarbon receptor (AHR), ARNT, SIM. hypoxiainducible factor 1 (HIF-1), Member Of the PASsuperfamily (MOPs), and the trachealess (TRH) protein. Dimerization is known to be mediated by both the bHLH region of these transcriptionactivators and by their PAS repeats (9–14). Some PAS-containing proteins lack a bHLH region (PER) but can still either homodimerize (9) orheterodimerize with other bHLH-PAS-containing proteins (9–11) through their PAS domains in vitro. A second function for PAS domains isligand and/or cofactor binding, as is the case for AHR (15) and for the heme-binding bacterial O2-sensing protein FixL (16).

Sequence similarities between the PAS repeats and photo-active yellow protein (PYP), a self-contained bacterial blue-light photoreceptor(17–19) implicated in negative phototaxis (20), were identified by Lagarias (21), who probed a sequence database with a 43-residue consensussequence constructed from the PAS-A and PAS-B domains of phytochromes, which are the red and far-red photoreceptors in plants. Recently,many more PAS domains have been shown to exhibit sequence similarities with PYP (22, 7, 23, 5, 4), and sequence alignments have beenextended into the “S2” region that is located immediately C-terminal to the original PAS repeats (5) and into a second more C-terminal regionof the PAS B sequences, termed “PAC” (4). PYP also exhibits the functions characteristic of PAS domains: sensing (of light), binding ofligand/ cofactor (chromophore), and signal transduction through protein-protein interaction (with the downstream partner of PYP).

‡To whom reprint requests should be addressed. E-mail: [email protected].© 1998 by The National Academy of Scicnces 0027–8424/98/955884–7$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: PYP. photoactive yellow protein: PER, period: ARNT. aryl hydrocarbon receptor nuclear translocator: SIM, single-

minded: bHLH, basic helix loop helix; AHR. aryl hydrocarbon receptor; HIF-1, hypoxia-inducible factor 1; MOPs, member of the PASsuperfamily; TRH, trachealess; 3D, three dimensional.

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5884

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 46: (NAS Colloquium) Computational Biomolecular Science

Here, we present the hypothesis that the entire PYP protein fold is the structural prototype for the modular, three dimensional (3D), PAS Aand PAS B domain folds in PAS-containing proteins. We define the PAS/PYP module with a length of ~20–150 residues that matches thelength of the PYP sequence and encompasses the entire PYP protein fold. PYP therefore appears to be both structurally and functionally aprototypical PAS domain, with a modular single-domain fold that links sensing and ligand binding to signal transduction via dimerization withanother protein. A 3D molecular model of the PAS B domain of the human protein ARNT, based on the PYP structure, provides a detailedstructural framework for integrating and differentiating sequence and functional data and suggests specific regions to test experimentally forinvolvement in ligand binding, dimerization, and signal transduction.

METHODS

Sequence Alignment. Representative members of the ARNT family were aligned to each other using the program PILEUP (24) from theGenetics Computer Group (GCG) suite (25). Then several PAS-containing sequences, identified from previous publications (4, 5, 26) andobtained from the National Center for Biotechnology Information Entrez Web service (http://www.ncbi.nlm.nih.gov/Entrez/) were thensuccessively added to the alignment with the PILEUP program. Finally, the sequence of PYP was added manually to complete the sequencealignment (see Fig. 2). Automatic alignment of the PYP sequence with the PAS domain sequences is hindered by the presence of a fewapparent mismatched residues that inflate the alignment score (mainly Gly→hydrophobic). Such residues are discussed in detail later.

Molecular Modeling. Using the alignment in Fig. 2, we built a 3D model of the PAS B domain of the human ARNT protein, based on thecoordinates of dark-state PYP (2phy.pdb) (27). Nonidentical side chains were replaced using the program XFIT (28). Side-chain conformationswere taken from a rotamer dictionary based on the library of Tuffery et al. (29). In each case, the most common rotamer was used unless itdisplayed strong steric clashes with any backbone atoms. Alternative rotamers were used for residues 33, 34, 62, 63, and 112 (PYP numberingis used throughout this paper), and a significant deviation from the standard rotamers was necessary to fit residue 29. The resulting ARNTmodel was energy minimized with the XPLOR program (30). using the conjugate gradient method (31) and the CHARMM22 all-atom parameter set(32). The dielectric constant was set to one. The shifted electrostatic and the switched van der Waals functions were selected using a cut-onvalue of 6.5 Å and a cut-off value of Å. Nonbonded interactions were cutoff beyond 9 Å. Hydrogen atoms were added with the HBUILD program(33), and their positions were energy-minimized until the norm of the gradient was <0.1. Then, while the backbone was kept fixed, all side-chain positions in the model were energy-minimized with the electrostatic energy term turned off, until the norm of the gradient was <0.5. Thiswas followed by a short minimization of all atomic positions (norm of the gradient <2.0) in order to remove any remaining clashes betweenside chain and main chain atoms. Two-residue insertions at the positions 87 and 98, and a single-residue deletion at position 69 were introducedwith the program TURBOFRODO (34). Energy minimization was performed again as described above. The root-mean-square deviation betweenthe backbone atoms (N, Cα, and C) of the model and of PYP is 0.76 Å. A quality check was performed by the program PROCHECK (35) andshowed that 87.1% of �-� angles were in the most conserved regions compared with 90.6% in the crystallographic structure of dark-state PYP.

RESULTS

The PAS/PYP Module. PYP provides a structural and functional prototype for the 3D fold of the PAS domain superfamily (Fig. 1),which we name the PAS/PYP module. PYP is a self-contained, bacterial blue-light photoreceptor, with an unusual fold characterized by acentral six-stranded β-sheet with N- and C-terminal β-strands in the center (27). The overall PYP fold breaks down into four segments: (i) theN-terminal helical lariat (residues 1–28), including helices al and α2, (ii) the first three-stranded half of the central β-sheet (residues 29–69),including the β1, β2 hairpin, two short intervening α-helices (α3 and α4), β3, and an overlapping turn of π-helix, (iii) the helical connector(residues 70–86), composed predominantly of the long α5-helix that diagonally crosses the β-sheet, to connect the two edge β-strands, and (iv)the last three-stranded half of the central β-sheet, including β4, a connecting loop, and the β3, β6 hairpin. PYP has a hydrophobic core on eachside of the central β-sheet (27). The N-terminal helical lariat caps one side of the β-sheet to form the smaller hydrophobic core. The remaininghelices and loops, together with the central β-sheet, surround the 4-hydroxycinnamoyl chromophore, to form the larger hydrophobic core. Helixα3 and flanking residues contribute the hydrogenbonding network for the phenolic hydroxyl at the tip of the chromophore.

The PAS core (Fig. 1), the second segment of the PAS/PYP module, provides the photosensing active site of PYP and roughlycorresponds to the traditional repeating PAS sequence motif of ~50 amino acids (1–3). This key portion of the PYP structure, ending at theCys69 attachment site for the chromophore, forms the majority of the immediate environment of the chromophore, provides all residues thathydrogen bond the chromophore, and supplies the Arg52 gateway (27). The Arg gateway likely participates in PYP heterodimerization with adownstream signal transduction protein, by moving and pro

FIG. 1. A proposed PAS/PYP 3D fold illustrated on the PYP structure. The N-terminal cap, colored in purple, containsresidues from 1 to 28. The PAS-core, colored in gold, is the domain where higher sequence homology is found amongvarious members of the PAS-containing molecules. It spans from residue 29 to 69. The helical connector, colored in green,includes a short loop followed by the helix α5 and spans residues from 70 to 87. The β-scaffold, colored in blue, contains thelast three strands of PYP, spanning residues from 88 to 125.

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5885

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 47: (NAS Colloquium) Computational Biomolecular Science

viding solvent access to the chromophore during the long-lived, bleached, signaling intermediate of the PYP light cycle (36).The β-scaffold, the fourth segment of the PAS/PYP module, provides a long platform with a characteristic β-sheet twist that supports the

PAS core and completes the central six-stranded PYP β-sheet. This β-scaffold (Fig. 1) approximately matches the PAC sequence motifdescribed by Ponting and Aravind (4). Within this β-scaffold, the end of the fourth β-strand plus the ω-loop (37) connecting β4 and β5 wraparound the PAS core to complete the chromophore environment. In the PAS ‘‘S-box” nomenclature created by Zhulin and coworkers (5), the S1box corresponds to the last three-fourths of the PAS core, and the S2 box covers most of the β-scaffold.

The central β-sheet of the PAS/PYP module is protected from solvent by the two remaining segments: the N-terminal cap protects oneside, whereas the helical connector combines with parts of the PAS core to protect the other. The PAS-related LOV (light, oxygen, or voltage)sequence motif encompasses the PAS core, the helical connector, and the β-scaffold. This sequence region was identified by Briggs andcoworkers (8) in the plant protein NPH1 (which participates in the signal-transduction pathway for phototropism) and in a family of proteinsregulated by environmental factors that could change their redox status.

Sequence conservation between PYP and PAS domains. PAS domain sequences occur in all three kingdoms of life, and act in amultitude of regulatory, sensing, and signal transduction pathways. Thus, sequence similarities are limited and may be further obscured by thefunctional variation of buried active-site residues that would otherwise exhibit the relative conservation expected by inward-facing residues ofthe hydrophobic core. To evaluate the limited sequence similarity between PYP and PAS domains, we compiled a full-length PYP sequencealignment with a set of PAS domain sequences (Fig. 2). We used automated protein sequence alignment to align the full-length sequences ofclosely related PAS proteins (Fig. 2), including the original PAS trio of PER, ARNT, and SIM, as well as the more recently discoveredmammalian CLOCK proteins and their homologues. PYP sequences from Ectothiorhodospira halophila (38), corresponding to thecrystallographic structure, and three other bacteria were added manually, starting from the sequence registration within the PAS core identifiedby Lagarias et al. (21) using a phytochrome PAS repeat consensus sequence as the search probe. ARNT was chosen as the major PAS proteinfamily to include and model-build because of the many known sequences from

FIG. 2. Similarities revealed by multiple sequence alignment of several members of the PAS-containing proteins andmembers of the PYP family. The alignment was performed using the program PILEUP in the GCG suite (25) starting from theARNT molecules, then adding each PAS-containing molecule in the list. PYP molecules were manually aligned on the top(see Methods). White letter amino acids are conserved in both PYP and PAS-containing proteins. Red letter amino acidshighlight significant differences between PYP and PAS-containing proteins. The secondary structure of PYP is displayed onthe top using the color coding from Fig. 1. Helices are represented by “noodles”, strands by arrows and loops by lines.Accession numbers from the SwissProt database, as extracted using the Entrez web service (http://www.ncbi.nlm.nih.gov/Entrez/protein.html) are P16113 (pyp.ecto), X98888 (pyp_rhodosp). X98889 (pyp_rhodoba), M19029 (sim_fly), U33427(trh_fly), U22431 (hifa_human), U51627 (mop3_human), X03636 (per_fly), U10325 (arnt_mouse), M69238 (arnt_human),D45239 (arnt_rabbit), and AF020426 (arnt_fly). Other sequences were obtained through the Entrez service by searching fullnames of proteins (pyp_chroma, clock_mouse, and arnt_trout).

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5886

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 48: (NAS Colloquium) Computational Biomolecular Science

diverse species, their relatively unambiguous alignment with PYP sequences, and the importance of ARNT as a common regulatory partner inmany PAS protein heterodimers (12, 13, 39, 40). In trial alignments, we discovered that the PAS-B sequences align better with PYP than dotheir more N-terminal PAS-A counterparts. As shown in Fig. 2, we found that the sequence alignment of PYP with PER and with these bHLH-PAS-containing transcriptional activators can be extended both N- and C-terminally to encompass the entire PYP single-domain fold.

The diversity of PAS domain sequences and the low sequence similarity among the more distant members made sequence alignmentchallenging but also provoked an interesting discovery. Automated alignment with standard programs failed to properly align PYP sequenceswith the PAS proteins. Similarly, when searching the nonredundant protein sequence database with the BLAST server (http://www.ncbi.nlm.nih.gov/

BLAST) the PYP sequence alone failed to select any PAS-domain proteins (although PAS protein sequences can successfully select PYP; ref.21). However, we discovered that by simply changing a few functionally key PYP residues (G29 into F, G47 into V, and R52 into Y). themodified sequence was now able to pick a member of the ARNT family in a BLAST search. These residues were chosen because their specificroles in PYP would not likely be conserved in any other PAS domain proteins. As shown in Fig. 2, PYP-conserved G29 and G47 are replacedby large hydrophobic residues in the PAS domain sequences. In sequence evolution, the conservation of glycine residues often results fromtheir unrestricted backbone dihedral angles, freed by the absence of a Cβ atom. However, in the PYP structures, neither G29 nor G47 occupies aregion of �, � dihedral space forbidden to other larger amino acids. Instead, substitution of these glycines in PYP appears to be spaciallyprohibited by the proximity of adjacent side chains that make key interactions with the chromophore of PYP which is unlikely to be present inPAS domain proteins. In PYP, substitution at G29 would likely interfere with buried E46, which forms an important salt bridge with thephenolic oxygen of the deprotonated chromophore of dark-state PYP (27). Likewise, substitution at PYP G47 would likely interfere with buriedY42, which hydrogen bonds with the same phenolic oxygen of the PYP chromophore. During the PYP photocycle, R52 actively participates insignal transduction by undergoing conformational changes that allow the photoisomerized chromophore access to solvent (36). Although thesethree PYP function-specific residues would not require conservation in PAS proteins, standard substitution matrices used for automatedsequence alignment cannot reasonably accommodate the resulting sequence substitutions and thus fail to perform appropriate sequencealignments. Such alignment peculiarities may often stymie sequence alignment programs and preclude sequence alignment where no structuralinformation is available.

At the sequence level, the alignment presented in Fig. 2 identifies several interesting features. First, only two short insertions into PYP arerequired: between residues 87 and 88 (PYP numbering is used throughout) and between residues 98 and 99. Second, a single short gap in SIM,TRH. and HIF-1α occurs near the N terminus. Third, PYP C69, which carries the chromophore, is deleted. Fourth, the following residuesmostly are conserved among all sequences: V4, D34, G37,139, N43, G51, P54, V57, I58, G59, K60, N61, F63, P68, D71, F79, F92, Y94,V120, and F121. Fifth, the following residues represent the major differences between PYP and the PAS domains: G29, Y42, E46, G47, R52,D65, A67, T/A70, E/D93, and D/A97. These sequence conservations and differences are clarified at the 3D level, based upon the PYP structure.

Structural Consequences of Sequence Conservation and Differences. As PAS domains function to mediate macromolecularinteractions along signalling pathways, the development and analysis of 3D structural models for specific PAS/ PYP modules is immediatelyuseful for designing experiments to probe function. We explored the potential structural and functional consequences of significant sequencesimilarities and differences by examining the structural roles of the conserved residues in PYP (Fig. 3) and by mapping residues exhibitingsequence conservation and major sequence differences onto the molecular model of the ARNT PAS B domain (Fig. 4).

In general, both the greatest sequence conservation (Figs. 2 and 3) and the most striking sequence differences (Fig. 2) occur within thePAS core (Fig. 4), which represents the “active site” and chromophore ligand-binding region of PYP. Each of the 20 natural amino acids doesnot have the same probability of being structurally conserved. Pro residues, because of their backbone conformational restriction, and Glyresidues, especially those with left-handed backbone conformations, often are evolutionarily conserved within a protein family. Both Gly andPro residues frequently contribute to kinks and turns in the polypeptide chain. Pro 54 in PYP is

FIG. 3. Detailed residue interactions for conserved residues that form specific side-chain to main-chain hydrogen bonds inPYP and appear to be retained in PAS-containing molecules. First, Asp 34 OD1 hydrogen bonds to three backbone nitrogensfrom residues D36, G37, and N38 (2.87 Å, 3.08 Å and 2.89 Å, respectively). Most residues at position 34 have an atom OD1(Asp, Asn) or similar (OG1 in Ser, Thr). Second, Asn 43 OD1 hydrogen bonds to three backbone nitrogen atoms: A30, A45,and E46 (2.96 Å, 3.55 Å, and 3.06 Å). All residues 43 in Fig. 2 have an atom OD1 or OE1. Third, Asn61 OD1 and ND2hydrogen bond to three backbone nitrogens and one backbone oxygen: F62. F63, K64, and D36 (3.39 Å, 3.09 Å, 3.00 Å, and2.96 Å). Almost all residues 61 in Fig. 2 have an atom OD1 or OG1. Drawing made with the program MOLSCRIPT (46).

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5887

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 49: (NAS Colloquium) Computational Biomolecular Science

conserved in all PAS domains shown in Fig. 2. PYP has five left-handed Gly residues: G7 and G59 participate in type II β-turns, G37 mediatesa β-bulge, G51 ends α3, and G86 ends the helical connector α5. In the PAS domain proteins of Fig. 2, P54 is conserved completely; G51 andG59 of PYP are conserved predominantly or substituted with residues (Asp, Asn, and Glu) that are fairly tolerant of left-handed backboneconformations; G7 and G37 of PYP also are mostly conserved but show more unusual substitutions; and the remaining PYP residues with left-handed backbone conformations (G86 and Q99) occur near regions tolerant of insertions, making analysis of their conformations difficult.

FIG. 4. The PAS-domain of the human ARNT protein modeled from the PYP crystal structure (2phy.pdb). The Cα trace isrepresented by a tube. The N-terminal cap is colored in magenta, the PAS-core in gold, the helical connector in green, and theβ-scaffold in blue. Conserved side chains between PYP and PAS-containing proteins are displayed in white. Amino acids thatsignificantly vary between PYP and PAS-containing molecules are drawn in red. PYP’s chromophore is displayed in yellowin the same orientation as in the PYP molecule. Most conserved residues are located in the hydrophobic core of the PAS-coredomain (in gold). Most of significantly variant amino acids (in red) occur in the vicinity of the chromophore pocket asexpected from molecules that carry different biological functions. In the ARNT model, H67 and F29 occupy thechromophore pocket. Figure displayed using the Application Visualization System (AVS) (AVS, Waltham, MA).

Other residues likely to be conserved are those that make critical hydrogen bonds required either for proper folding or for stabilizing aparticular fold. In particular, residues that form key side-chain hydrogen bonds to main-chain atoms of other residues are expected to beconserved. Residues D34, N43, K60, and N61 all serve this function in PYP, and their counterparts in the PAS domains of Fig. 2 should be ableto maintain this hydrogen-bonding function. Asp 34 in PYP can make up to three hydrogen bonds to peptide nitrogen atoms from residues 36–38 (Fig. 3, Top), This requires an appropriately positioned hydrogen-bond acceptor atom like OD1 of Asp or OD1 from Asn (ARNT, TRH,Fig. 3, Top) or OG1 from Ser/Thr (PER, CLOCK, HIF-1α). Conversely, the Asn at position 43 also can make up to three hydrogen bonds tonitrogen backbone atoms of residues 30, 45, and 46. This particular interaction also can be mediated by the atom OD1 of the Asp in ARNT,PER, Clock, MOP3, SIM and HIF-1α or an OE1 from a Glu in TRH (Fig. 3, Middle). In these two examples, hydrogen bonds stabilize a turnstructure. The last example involves hydrogen bonds between a conserved Asn in PYP and ARNT at position 61 where both their hydrogenbond donor and acceptor atoms can make contacts with other backbone atoms (Fig. 3, Bottom). Although, this Asn is not fully conserved inevery PAS domain (Fig. 2), the hydrogen bonds to OD1 that mediate π helix formation are conserved by most residues at position 61 (Asn/Ser/Thr residues, Fig. 3, Bottom).

The segments of the PYP protein outside of the PAS core exhibit lesser sequence conservation. Here, those residues that are conservedacross all or most of the PAS domain protein families (Fig. 2) generally have identifiable roles in defining the secondary and tertiary structureof PYP. Both the helical connector and the β-scaffold show lesser sequence conservation (Fig. 2), consistent with their apparent role inmaintaining appropriate secondary structural elements for the PAS/PYP module. The largest length variations in the sequence alignment arelocated at PYP positions 87–88 in the helical connector, which was observed previously to differ among PAS domains (5). The only otherinsertion in PYP maps to the turn joining the first two strands (β4 and β5 of the β-scaffold, Figs. 2 and 4). This insertion, found in every PASdomain sequence in Fig. 2, may highlight a significant structural difference distinguishing PYP from other PAS-containing proteins. The N-terminal cap, which is located on the opposing side of the module from the PYP active site, appears the least conserved among PAS domainproteins and in some cases may be substituted with other structures capable of protecting the central β-sheet from solvent.

Hydrophobic residues are usually buried in the core of a protein and therefore should show some conservation of their hydrophobiccharacter. However, unlike the specificity of particular side-chain to main-chain hydrogen bonds, the non-specific nature of hydrophobicpacking allows more liberty in sequence variation. In particular, complementary substitutions of hydrophobic residues to maintain a wellpacked hydrophobic core is frequently observed (41). Hydrophobic core residues conserved between PYP and PAS domains include: V4, 139,V57, I58, F63, F79, F92, Y94, V120, and F121, which are respectively conserved as or replaced by V4, V/F39, L/V57, L/V58, Y/M/V/L63, F/Y/H79, F/Y92, L/M/F/A94, I/F120, and I/V121 in PAS domains (Figs. 2 and 4).

PAS Domain Differences in the PYP Chromophore Environment. Although PYP and PAS-containing proteins share sensor and signaltransduction functions, PYP incorporates both functions within a single, small, globular domain of 125 residues, whereas the much largermultidomain PAS-containing proteins like the phytochromes segregate these functions to different domains. The covalently bound 4-hydroxycinnamoyl chromophore of PYP that is necessary for light-mediated negative phototaxis is not found in any of the well characterizedPAS domain proteins. Hence, it is expected that residues of the chromophore environment will differ between PYP and PAS domains. Indeed,most of the PYP residues that are not shared with the other PAS domain proteins interact with the chromophore. In the PAS domain proteins inFig. 2, chromophore-bound PYP Cys69 is deleted, the inward-facing hydrophilic side chains (Y42, E46, and T50) that form the hydrogen-bonding network stabilizing the phenolic hydroxyl at the buried tip of the 4-hydroxycinnamoyl chromophore (27, 42) are replaced withhydrophobic residues, and the R52 side chain that forms the gateway of the chromophore to solvent (27, 36, 42) is converted to Tyr.

In the ARNT model (Fig. 4), the cavity created by the absence of the PYP chromophore is partly filled by very conserved H67, which canmake a buried salt bridge with the conserved E/D70 in the PAS domains (Fig. 4). However, a predominantly hydrophobic cavity about one-halfthe size of a heme remains, caused by the reduction in size of residues Y42, F62, and F96 in PYP becoming V42, I62, and S96 in ARNT. Thiscavity might provide insights into the ligand-binding properties of the PAS domains including possible specific

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5888

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 50: (NAS Colloquium) Computational Biomolecular Science

hydrophilic interactions with conserved residues H67 and E70. Interestingly, PAS-conserved ARNT F29, replacing PYP G29, also can bedirected into the central cavity because PYP E46 has been replaced with a smaller hydrophobic residue (Cys in ARNT). This explains why G29was identified as PYP-specific during a BLAST search, as described above. A large hydrophobic residue at this location can help to fill the voidleft by the absence PYP’s chromophore.

The last major sequence differences between PYP and PAS-containing proteins in the chromophore environment are located at positions93 and 97, where negative charges are replaced by positive charges. These residues are part of a surface array of positive charges that includesR93, R95, K97, W99, and W101 in ARNT and in most PAS domains. These residues are very close to the insertion located at position 98 inPYP (Fig. 4), which was modeled to be a turn (consistent with predictions by the program DSSP; ref. 43). In PYP, this turn is positioned at theentrance of the chromophore pocket and may be important for a putative interaction with a partner to accomplish the role of PYP in signaltransduction.

Protein-Protein Dimerization Interface. During PYP’s light cycle, the chromophore undergoes a trans-to-cis isomerization and theprotein rearranges slightly to accommodate the new chromophore configuration (36). The largest movements, taking place in the chromophoreand the side chain of R52, presumably send the signal to the unknown downstream partner of PYP. leading eventually to negative photoaxis.Therefore, the surface of PYP surrounding these moving residues (Fig. 4) likely provides the interaction face for heterodimerization with thesignaling partner of PYP. In other PAS-containing proteins, the PAS domains mediate homo- or hetero-dimer formation, in most cases withother PAS domains. We suggest that the interaction face of PYP is the prototype for the dimerization interface of the PAS domain protein superfamily.

This putative dimerization interface includes residues from three regions: (i) a central region (51–68), that includes residues located within10 Å of Y52, (ii) the loop 95–103, that includes the insertion in the β scaffold, and (iii) two residues from α3 in the PAS core domain that areH44 and R45 in ARNT. Residues that are exposed to solvent are displayed in Fig. 5A. The molecular surface area of these regions are shown inFig. 5B. Exposed side chains are H44, R45, Y52, Q53, Q55, E56, K60, F65, R95, K97, N98, Q98A, E98B, W99, W101, and R103. Almost allof the residue types found in this interface can form specific contacts to other amino acids and are characteristic of other protein-proteininterfaces. The loop composed of residues 95–103, which differs between PYP and the PAS-containing proteins in Fig. 2. might be involved inthe recognition specificity for PAS dimerization because of the low sequence homology among PAS domains in this area (Fig. 2).

DISCUSSION

Although sequence comparison alone is insufficient to demonstrate the proposed structural similarities between PYP and the PASdomains, low sequence homology was expected because of the evolutionary diversity among PAS-containing proteins, which populate all threekingdoms of life. Instead, the similarity and potential homology identified between PYP and the PAS domain superfamily is corroborated byfinding that a modified PYP sequence, replacing only three residues specific to PYP function, allows automated selection of several PASdomain sequences from a nonredundant protein sequence database, albeit with resulting low scores, as expected. The potential homologybetween PYP and PAS domains is further supported by our ability to generate from the PYP crystallographic structure, a well behavedmolecular model for the PAS domain from ARNT, a member of the PAS domain superfamily. Indeed, only two insertions and a single deletionwere needed for building a 3D model of the PAS-B domain of human ARNT, which exhibits a root-mean-square deviation of 0.76 Å fromPYP’s 3D structure, and comparable quality and stability. Moreover, from the resulting molecular model of the ARNT PAS domain, it ispossible visually to identify and understand similarities and differences between PYP and the PAS domains, supporting the proposition for aPAS/PYP prototypical fold. Given that the PAS/PYP module hypothesis is valid, the 3D model of the ARNT PAS-B domain provides usefulinsights concerning the two major known functions of PAS domains: protein-protein interaction and ligand binding.

FIG. 5. Predicted PAS functional interactions. (A) Amino acid side chains that might participate in a protein-proteininteraction are highlighted. The central segment, formed by residues 51–68, is located within 10 Å of the residue Y52 (inyellow). The second area, from residue 95–103 (in cyan), is made of a loop in which an insertion occurs at position 98 in eachPAS-containing molecule. The third area is made by two residues (orange) adjacent to the central segment (yellow). H44 andR45, for which their side chains point toward the solvent. (B) same as A but the molecular surface for these residues isdisplayed.

Protein-protein interactions between the ubiquitous partner ARNT and PAS-containing proteins AHR, SIM, MOPs, TRH, and HIF-1α (9–11, 13, 14, 44, 45) have been identified in vitro and in vivo. These interactions are mediated by both the bHLH region and the PAS domains.Because the bHLH region is a self-dimerizing structural motif (3), the PAS domains evidently

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5889

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 51: (NAS Colloquium) Computational Biomolecular Science

supply the recognition specificity needed for these interactions, rather than driving the interactions per se. This hypothesis is supported byrecent work from Zelzer and coworkers (14) revealing that the swapping of PAS domains between SIM and TRH confers the functionalspecificity of the PAS domain, rather than that of the parent protein. Based on the experimentally determined structures of PYP light-cycleintermediates (36), we propose that the region of the PAS/PYP fold that is involved in protein-protein interaction is centered around residue 52,as shown in Fig. 5. This hypothesis can be tested by site-directed mutagenesis of residues highlighted in this interface (Fig. 5).

Several PAS-containing proteins bind ligands and/or cofactors. To date, however, mapping the ligand binding to the PAS domain itself hasbeen demonstrated only for the FixL (16, 21, 4) and AHR (15) proteins. The PAS domain of FixL binds heme, whereas the PAS-B domain ofAHR binds dioxin and other poly-cyclic aromatic hydrocarbons. Interestingly, for both molecules, the minimum size of the PAS domain that isable to bind the ligand is ~130 residues, the size of the entire PAS/PYP module. Because of a reduction in size of several side chains in ARNTcompared to PYP, the 3D model of the ARNT PAS domain displays an internal cavity large enough to accommodate a medium-sized ligand(one-half a heme). Thus, the region previously occupied by the PYP’s chromophore is a logical choice for a ligand pocket.

In summary. PYP appears to exhibit all of the major structural and functional features characteristic of the PAS domain superfamily: amodular domain of ~125–150 residues, a sensor function linked to ligand or cofactor binding, and signal transduction capability governed byheterodimeric assembly. Thus, we propose the testable hypothesis that the entire PYP protein fold is the structural prototype for the modular,3D, PAS-A and PAS-B domain folds in PAS-containing proteins. This PAS/PYP module provides a structural model to guide experimentaltesting of hypotheses regarding ligand-binding, dimerization, and signal transduction in PAS proteins.

We thank Christopher Bruns and Christopher D.Putnam for preliminary sequence searches, alignments, and analyses; C.Bruns for helpwith the illustrations; and C.Bruns, C.D.Putnam, Ulrich K. Genick, and John Tainer for useful criticism and discussion. Research on PYP isfunded by the National Institutes of Health Grant GM37684 (to E.D.G.).1. Nambu, J.R., Lewis, J.O., Jr., Wharton, K.A. & Crews, S.T. (1991) Cell 67, 1157–1167.2. Crews, S.T., Thomas, J.B. & Goodman, C.S. (1988) Cell 52, 143–151.3. Hoffman, E.C., Reyes, H., Chu, F.-F., Sander, F., Conley, L. R., Brooks, B.A. & Hankinson, O. (1991) Science 252, 954–958.4. Ponting, C.P. & Aravind, L. (1997) Curr. Biol. 7, 674–677.5. Zhulin, I.B., Taylor, B.L. & Dixon, R. (1997) Trends Biochem. Sci. 22, 331–333.6. King, D.P., Zhao, Y., Sangoram, A.M., Wilsbacher, L.D., Tanaka, M., Antoch, M.P., Sleeves, T.D.L., Vitaterna, M.H., Kornhauser, J.M., Lowrey,

P.L., et al. (1997) Cell 89, 641–653.7. Kay, S.A. (1997) Science 276, 753–754.8. Huala, E., Oeller, P.W., Liscum, E., Man, I.-S., Larsen, E. & Briggs, W.R. (1997) Science 278, 2120–2123.9. Huang, Z.J., Edery, I. & Rosbash, M. (1993) Nature (London) 364, 259–262.10. Lindebro, M.C., Poellinger, L. & Whitelaw, M.L. (1995) EMBO J. 14, 3528–3539.11. McGuire, J., Coumailleau, P., Whitelaw, M.L., Gustafsson, J.-A. & Poellinger, L. (1996) J. Biol. Chem. 270, 31353–31357.12. Jiang, B.-H., Rue, E., Wang, G.L., Roe, R. & Semenza, G.L. (1996) J. Biol Chem. 271, 17771–17778.13. Hogenesch, J.B., Chan, W.K., Jackiw, V.H., Brown, R.C., Gu, Y.-Z., Pray-Grant, M., Perdew, G.H. & Bradfield, C.A. (1997) J. Biol. Chem. 272,

8581–8593.14. Zelzer, E., Wappner, P. & Shilo, B.Z. (1997) Gene Dev. 11, 2079–2089.15. Fukunaga, B.N., Probst, M.R., Reisz-Porszasz, S. & Hankinson, O. (1995) J. Biol. Chem. 270, 29270–29278.16. Monson, E.K., Weinstein, M., Ditta, G.S. & Helinski, D.R. (1992) Proc. Natl. Acad. Sci. USA 89, 4280–4284.17. Meyer, T.E. (1985) Biochim. Biophys. Acta 806, 175–183.18. Meyer, T.E., Yakali, E., Cusanovich, M.A. & Tolin, G. (1987) Biochemistry 26, 418–423.19. Meyer, T.E., Tollin, G., Hazzard, J.H. & Cusanovich, M.A. (1989) Biophys. J. 56, 559–564.20. Sprenger, W.W., Hoff, W.D., Armitage, J.P. & Hellingwerf, K.J. (1993) J. Bacteriol. 175, 3096–3104.21. Lagarias, D.M., Wu, S.-H. & Lagarias, J.C. (1995) Plant Mol. Biol. 29, 1127–1142.22. Linden, H. & Macino, G. (1997) EMBO J. 16, 98–109.23. Crosthwaite, S., Dunlap, J.C. & Loros, J.J. (1997) Science 276, 753–754.24. Feng, D.F. & Doolittle, R.F. (1987) J. Mol Evol. 25, 351–360.25. Devereux, J., Haeberli, P. & Smithies, O. (1984) Nucleic Acids Res. 12, 387–395.26. Hahn, M.E., Karchner, S.I., Shapiro, M.A. & Perera, S.A. (1997) Proc. Natl. Acad. Sci. USA 94, 13743–13748.27. Borgstahl, G.E.O., Williams, D.R. & Getzoff, E.D. (1995) Biochemistry 34, 6278–6287.28. McRee, D.E. (1992) J. Mol. Graphics 10, 44–47.29. Tuffery, P., Etchebest, C., Hazout, S. & Lavery, R. (1991) J. Biomol Struct. Dyn. 8, 1267–1289.30. Brünger, A.T. (1992) X-PLOR, A system for X-ray crystallography and NMR (Yale University, New Haven, CT), Version 3.1.31. Powell, M.J.D. (1977) Math. Program. 12, 241–254.32. Brooks, B., Bruccoleri, R., Olafson. B., States, D., Swaminathan, S. & Karplus, M. (1983) J. Comp. Chem. 4, 187–217.33. Brünger, A.T. & Karplus, M. (1988) Proteins 4, 148–156.34. Roussel, A. & Cambillau, C. (1989) TURBO-FRODO in Silicon Graphics Geometry Partners Directory (Silicon Graphics, Mountain View, CA),

Version 5.2.35. Laskowski, R.A., MacArthur, M.W., Moss, D.S. & Thornton, J.M. (1993) J. Appl. Crystallogr. 26, 283–291.36. Genick, U.K., Borgstahl, G.E.O., Ng, K., Ren, Z., Pradervand, C., Burke, P.M., Srajer, V., Teng, T.-Y., Schildkamp, W., McRee, D.E., et al. (1997)

Science 275, 1471–1475.37. Leszczynski, J.F. & Rose, G.D. (1986) Science 234, 849–855.38. Baca, M., Borgstahl, G.E.O., Boissinot, M., Burke, P.M., Williams, D, R., Slater, K.A. & Getzoff, E.D. (1994) Biochemistry 33, 14369–14377.39. Wang, G.L., Jiang, B.-H., Rue, E.A. & Semenza, G.L. (1995) Proc. Natl. Acad. Sci. USA 92, 5510–5514.40. Kallio, P.J., Pongratz. I., Gradin, K., McGuire, J. & Poellinger, L. (1997) Proc. Natl. Acad. Sci. USA 94, 5667–5672.41. Getzoff, E.D., Tainer, J.A., Stempien, M.M., Bell, G.I. & Hallewell, R.A. (1989) Proteins Struct. Funct. Genet, 5, 322–336.42. Genick, U.K., Devanathan, S., Meyer, T.E., Canestrelli, I.L., Williams, E., Cusanovich, M.A., Tollin. G. & Getzoff, E.D. (1997) Biochemistry 36, 8–

14.43. Kabsch, W. & Sander, C. (1983) Biopolymers 22, 2577–2637.44. Ohshiro, T. & Saigo, K. (1997) Development (Cambridge, U.K.) 124, 3975–3986.45. Sonnenfeld, M., Ward, M., Nystrom, G., Mosher, J., Stahl, S. & Crews, S. (1997) Development (Cambridge, U.K.) 124, 4571–4582.46. Kraulis, P.J. (1991) J. Appl. Crystallogr. 24, 946–950.

PHOTOACTIVE YELLOW PROTEIN: A STRUCTURAL PROTOTYPE FOR THE THREE-DIMENSIONAL FOLD OF THE PASDOMAIN SUPERFAMILY

5890

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 52: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5891–5898, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

New methods of structure refinement for macromolecular structuredetermination by NMR

(coupling constants/chemical shifts/conformational database/diffusion anisotropy/dipolar couplings)G.MARIUS CLORE* AND ANGELA M.GRONENBORN*Laboratory of Chemical Physics, Building 5, National Institute of Diabetes and Digestive and Kidney Diseases. National Institutes of

Health, Bethesda, MD 20892–0520ABSTRACT Recent advances in multidimensional NMR methodology have permitted solution structures of proteins in excess of

250 residues to be solved. In this paper, we discuss several methods of structure refinement that promise to increase the accuracy ofmacromolecular structures determined by NMR. These methods include the use of a conformational database potential and directrefinement against three-bond coupling constants, secondary13C shifts, 1H shifts, T1/T2 ratios, and residual dipolar couplings. Thelatter two measurements provide long range restraints that are not accessible by other solution NMR parameters.

The two major techniques for determining the three-dimensional structures of macromolecules at atomic resolution are x-raycrystallography in the solid state (single crystals) and NMR spectroscopy in solution. Unlike crystallography, NMR measurements are nothampered by the ability or inability of a protein to crystallize. The size of macromolecular structures that can be solved by NMR has beenincreased dramatically over the last few years (1). The development of a wide range of two-dimensional (2D) NMR experiments in the early1980s culminated in the determination of the structures of a number of small proteins (2, 3). Under exceptional circumstances, 2D NMRtechniques can be applied successfully to determine structures of proteins up to �100 residues (4, 5). Beyond �100 residues, however, 2D NMRmethods fail, principally because of spectral complexity that cannot be resolved in two dimensions. In the late 1980s and early 1990s, a seriesof major advances took place in which the spectral resolution was increased by extending the dimensionality to three and four dimensions (1).In addition, by combining such multidimensional experiments with heteronuclear NMR, problems associated with large linewidths can becircumvented by making use of heteronuclear couplings that are large relative to the linewidths. The first successful application of thesemethods to a protein greater than �12 kDa was achieved in 1991 with the determination of the solution structure of interleukin 1β, a protein of18 kDa and 153 residues (6). Concomitant with spectroscopic advances, significant improvements have taken place in the accuracy with whichmacromolecular structures can be determined. Thus, it is now potentially feasible to determine the structures of proteins in the 15- to 35-kDarange at a resolution comparable to �2.5-Å resolution crystal structures (7). The upper limit of applicability is probably 60–70 kDa, and thelargest single-chain proteins solved to date are �30 kDa, comprising �260 residues (8, 9). In this paper, we discuss a number of new refinementstrategies aimed at both facilitating NMR structure determination and increasing the accuracy of the resulting structures. These include directrefinement against three-bond coupling constants (10) and13C and 1H shifts (11–13), as well as the use of conformational database potentials(14, 15). More recently, methods have been developed to obtain structural restraints that characterize long range order a priori (16–18). Thesemethods include making use of the dependence of heteronuclear relaxation on the rotational diffusion anisotropy of nonspherical molecules andof residual dipolar contributions to one-bond heteronuclear couplings arising from small degrees of alignment of molecules in a magnetic field.

General Principles of NMR Structure Determination. Irrespective of the algorithm used, any structure determination by NMR seeks tofind the global minimum region of a target function Etot given by: Etot=Ecov+Evdw+ENMR, where “Ecov,” “Evdw,” and “ENMR” are termsrepresenting the covalent geometry (bonds, angles, planarity, and chirality), the nonbonded contacts, and the experimental NMR restraints,respectively (19). Algorithms currently used include simulated annealing in both Cartesian (20, 21) and torsion angle space (22), metric matrixdistance geometry (23), and minimization with a variable target function in torsion angle space (24).

The main source of geometric information contained in the experimental NMR restraints is provided by the nuclear Overhauser effect(NOE). The NOE (at short mixing times) is proportional to the inverse sixth power of the distance between the protons, so its intensity falls offvery rapidly with increasing distance between proton pairs. Consequently, NOEs usually are observed only for proton pairs separated by ≤5–6Å. Despite the short range nature of the observed interactions, the short approximate interproton distance restraints derived from NOEmeasurements can be highly conformationally restrictive, particularly when they involve residues that are far apart in the sequence but closetogether in space (1, 19).

Systematic bias arising from the different algorithms used to calculate the structures may be introduced via the first two terms, Ecov andEvdw, in Eq. 1. The values of bond lengths, bond angles, planes, and chirality are known to very high accuracy, so it is clear that the deviationsfrom idealized geometry, as represented by the term Ecov, should be kept very small. The second term, Evdw, representing the nonbondedcontacts, is associated with considerably more uncertainty than the covalent geometry (25, 26). Given the numerous ways to represent Evdw (forexample, a simple van der Waals repulsion term or a complete empirical energy function including a van der

*To whom reprint requests should be addressed. e-mail: [email protected] and [email protected]–8424/98/955891–8$0.00/0PNAS is available online at http://www.pnas.org.Abbreviations: 2D, two-dimensional: NOE, nuclear Overhauser effect.

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5891

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 53: (NAS Colloquium) Computational Biomolecular Science

Waals Lennard-Jones 6–12 potential, an electrostatic potential, and a hydrogen bonding potential), it is evident that variability is introduced viaEvdw. It is therefore essential to ensure that the calculated structures display good nonbonded contacts.

The uncertainties associated with the covalent geometry and van der Waals terms can introduce errors of �0.3 Å in the coordinates (26).The major determinant of accuracy, however, resides in the number and quality of the experimental NMR restraints that enter into the thirdterm, ENMR, in Eq. 1.

Although a high resolution, carefully refined x-ray structure of a given protein may not be identical to the “true” solution structure, it islikely to be reasonably close in many instances, as evidenced, for example, by the excellent agreement (≤ 1 Hz rms deviation) between theexperimentally determined values of the 3JHNα coupling constants in solution and their corresponding calculated values from crystal structures(10, 27, 28). Moreover, it is generally the case that three-bond coupling constants, 13C secondary shifts, and 1H shifts calculated from highresolution crystal structures agree better with the experimentally measured values than those calculated from the corresponding NMR structures(refined in the absence of coupling constant and chemical shift restraints) (10–13, 25). It is therefore instructive to examine the dependence ofthe backbone rms difference between NMR and x-ray structures on the precision of the NMR structures (25). This dependence is shown inFig. 1 for 14 proteins, for which both NMR and x-ray structures are available and which are representative of some of the different programsused in NMR protein structure determination (25). A linear relationship is evident. In addition, in cases in which both low and high precisionNMR structures are available for the same protein, the high precision structure is significantly closer to the x-ray structure than the lowprecision one. The data can be fit to a straight line with a correlation coefficient of 0.9 and a limiting rms difference between NMR and x-raystructures of �0.45 Å. Moreover, all of the monomeric NMR structures with a precision of better than 0.5 Å are 0.85 Å or less away from thecorresponding crystal structures. Given the fact that the coordinate errors in 1.5- to 2-Å resolution x-ray structures are �0.2–0.3 Å (7, 29), thesedata provide empirical evidence that an accuracy of 0.4–0.8 Å in the backbone coordinates is attainable under appropriate circumstances byusing current NMR methodology (25).

FIG. 1. Correlation between backbone precision of NMR structures and their agreement with x-ray structures. Where thebackbone rms difference between the average NMR coordinates (NMR) and the corresponding x-ray structures is available,the values are represented as circles. When only the average backbone rms difference between an ensemble of NMRstructures (<NMR>) and the corresponding x-ray structure, is quoted in the literature, squares are used. The straight linerepresents a linear fit to the data with a slope of 0.70, an intercept of 0.45 Å. and a correlation coefficient of 0.9. Thestructures are as follows: p53(mon), p53(dim), and p53(tet) are the monomer, dimer, and tetramer, respectively, of the p53oligomerization domain (51); IL-8, interleukin-8 monomer (52); Hir (new), highly refined structure of hirudin (53); IL-1.interleukin-1β (6, 7); BPTI, bovine pancreatic trypsin inhibitor (54): eglin c (55); PC, French bean plastocyanin (56);tendamistat (57); Hir(old). hirudin (58); Cyp-CsA, cyclophilincyclosporin A complex (59): Mb. carbonmonoxy myoglobin(helices plus heme; ref. 60); CPI, potato carboxypeptidase inhibitor (61); PCP-B, procarboxypeptidase B (62); and BSPI,barley serine proteinase inhibitor 2 (63). The values given exclude conformationally disordered regions as described in thepapers cited. Note that the NMR structures of IL-8 and Hir(old) were obtained before the corresponding x-ray structures andthat the NMR structure of tendamistat was obtained independently of and at the same time as the x-ray structure. Reproducedfrom ref. 25.

The accuracy of NMR structures will be affected by errors in the interproton distance restraints. These errors can arise from two sources:(i) misassignments and (i) errors in distance estimates. Errors due to misassignments may be quite common in low resolution NMR structures.Fortunately, in many cases, these errors are of relatively minor consequence and do not result in the generation of an incorrect fold. Systematicerrors in distance estimates may be introduced in attempts to obtain precise distance restraints. For example, interactive relaxation matrixanalysis of the NOE intensities (30) and direct refinement against the NOE intensities (31, 32), while accounting for spin diffusion, can result insystematic errors from several sources such as: the presence of internal motions (not only on the picosecond time scale but also on thenanosecond to millisecond time scales); insufficient time for complete relaxation back to equilibrium to occur between successive scans; anddifferential efficiency of magnetization transfer between protons and their attached heteronucleus in multidimensional heteronuclear NOEexperiments (26). For these reasons, it is probably prudent at the present time, at least in cases dealing with proteins, to convert the NOEintensities into loose approximate interproton distance restraints (e.g., 1–8–2.7 Å. 1.8–3.3 Å, 1.8–5.0 Å, and, if appropriate, 1.8–6.0 Å forstrong, medium, weak, and very weak NOEs, respectively) with the lower bounds given by the sum of the van der Waals radii of two protons.These distance ranges are sufficiently generous to take into account untoward effects in the conversion of NOE intensities into distances (2, 3,19, 26). Using this approach, systematic errors in the interproton distance restraints generally will be introduced only at the boundary of twodistance ranges.

In the case of experimental structures calculated with an incomplete set of NOE restraints (i.e., comprising <90% of the structurally usefulNOEs), there is no doubt that errors, arising both from misassignments as well as from the incorrect classification of NOEs into the variousloose approximate distance ranges, will occur, resulting in less accurate structures. This loss in accuracy is due to the fact that, until asignificant degree of redundancy is present in the NOE restraints, such errors often can be accommodated readily without unduly comprisingthe agreement with either the experimental NMR restraints or the restraints for covalent geometry and non-bonded contacts. However, once90% of the structurally useful NOEs have been assigned and incorporated into the restraints set, corresponding typically to an average of 15restraints per residue with >60% of the NOEs involving unique proton pairs, two sensitive and complementary techniques can be employedeasily to identify and correct such errors.

The first method involves an analysis of the distribution of restraints violations in the ensemble of calculated structures. If a given restraintis systematically violated in more than, for example, 20% of the calculated structures, even by as little as 0.1 Å, it is highly likely that it shouldeither be reclassified into the next looser category (i.e., strong to medium, medium to weak) or that errors in NOE assignments are present (26).

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5892

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 54: (NAS Colloquium) Computational Biomolecular Science

The second approach uses complete cross-validation to assess the completeness of the experimental restraints and the degree to whicheach distance restraint can be predicted by the remaining ones (33). Typically, this approach involves calculating a series of simulatedannealing structures in which the restraints are partitioned randomly into a test set comprising �10% of the data and a reference set. Only thereference set is incorporated into the target function, and each calculation is carried out with a different test and reference set pair, therebypermitting one to fully explore the constraining power of the NOE restraints. The average agreement with all of the test sets as well as theatomic rms shift after complete cross-validation then provides an indicator of accuracy.

Finally, a further check on the correctness of the structures is provided by verifying that all short interproton distances (e.g., <3.5 Å)predicted by the structures are in fact observed in the NOE spectra (25). Indeed, this procedure forms the basis of the iterative refinementprocess; the structures at each successive stage of refinement are used to predict all short interproton distance contacts, which then are searchedfor in the NOE spectra. In general, the vast majority of interproton distances <3.5 Å, and certainly all of those <2.5 Å, should be observed.Exceptions can occur occasionally if the linewidths of the corresponding resonances are broadened severely because of some sort ofintermediate chemical exchange process on the chemical shift scale (caused, for example, by multiple conformations or microheterogeneity)resulting in severe attenuation of the NOE cross peaks.

Additional Experimental NMR Restraints that Define Short Range Order. Although the interproton distance restraints derived fromNOEs provide the mainstay of NMR structure determination, direct refinement against other experimental NMR restraints is both feasible anddesirable. In this section, we consider experimental restraints that provide short range structural information, specifically three-bond couplingconstants (10), secondary 13C chemical shifts (11), and 1H chemical shifts (12, 13).

Three-Bond Coupling Constants. Three-bond coupling constants are related to torsion angles by the Karplus (34) equation: 3J(λ)=Acos2(λ)+Bcos(λ)+C, where λ is the torsion angle corresponding to the three-bond coupling, and A, B, and C are constants obtained by nonlinearoptimization to yield the best fit between experimental 3J values and values calculated from a series of very high resolution x-ray structures.The coupling constants can be converted directly into loose torsion angle restraints (19). Alternatively, direct refinement against couplingconstants can be achieved by adding the potential EJ=kJ (Jobs–Jcalc)2 (where kJ is a force constant and Jobs and Jcalc are the observed andcalculated values of the coupling constants) (10).

From the standpoint of refinement, the most useful coupling constant, in so far that it can be measured accurately and easily byquantitative J correlation spectroscopy and that its Karplus relationship has been parametrized reliably, is the 3JHNα coupling, which is relateddirectly to the � backbone torsion angle (35). The Karplus curve for 3JHNα, however, is symmetric about �=–120°, such that one cannotdistinguish �=–120° +α from �=–120°–α from the 3JHNα coupling alone (36). Where appropriate, this degeneracy can be resolved byquantitative J-correlation measurement of the 3Jcoco coupling, which has its steepest � dependence close to �=–120° (36).

It is also worth noting that the relationship between the three-bond amide deuterium isotope shift experienced by 13Cα resonances, 3∆Cα

(ND), is related to the backbone � angle by a Karplus type relationship of the form 3∆Cα(ND)=30.1+ 22.2 cos (�– 90°) ppb (37) and hence canbe incorporated into structure refinement in exactly the same manner as three-bond coupling constants.

Secondary 13C Chemical Shifts. There is a clear empirical correlation between the protein backbone conformation, defined in terms ofthe � and � torsion angles, and the 13Cα and 13Cβ secondary chemical shifts (that is, the difference between observed shifts and random coilshifts) (38, 39). In addition, ab initio quantum mechanical calculations have indicated that the �,� angles dominate shielding for Cα and Cβ

atoms (40). Because the secondary I3Cα and 13Cβ shifts provide information on � as well as � and because they are readily measured, it isclearly useful to incorporate them directly into the refinement algorithm.

The strategy that we used makes use of an empirical surface describing the expected Cα and Cβ secondary chemical shifts as a function ofthe backbone torsion angles � and �, derived from the structurally ordered regions of a set of four proteins whose 13C chemical shifts wereknown and for which high resolution crystal structures are available (38). The expectation surface is given by exp

�k)2)/S]}, and similarly for Cβ expected (where S is a Gaussian scale factor given by r2/e0.5

where r is the radius of the Gaussian; in this case r=17.7° and S=450). The average rms difference between the observed chemical shift valuesand the empirical surface is �1.1 ppm. Direct refinement against the 13Gα and I3Cβ shifts is carried out by adding the potential

where and kCshift is a force constant (with a value chosento yield an rms difference between observed and calculated shifts of �1 ppm) (11).

To use simulated annealing to improve the agreement of the observed and expected carbon chemical shifts, the partial derivatives of theenergy along � and � (i.e., the forces along � and �) also must be calculated. These are given by δECshift/

Because there is no explicit function fitted to the expectation values, the partial derivatives of

Cαexpected and Cβ

expected with respect to � and � are approximated by the local slopes of the expectation value grid about the grid point (�, �) atwhich the energy is evaluated.

Although the information contained in the secondary 13Cα and 13Cβ chemical shifts is to some extent redundant with that offered by 3JHNαcoupling constants, the two experimental measures are complementary (11). Thus, the values of the 3JHNα coupling constants depend only on �,whereas the 13Cα and 13Cβ chemical shifts depend on both � and �. Moreover, 3JHNα coupling constants may not be measurable for all residuesbecause of small values of the couplings, line broadening, or chemical shift overlap of the backbone nitrogen atoms. In contrast, 13Cα and 13Cβ

shifts are obtained readily for almost all residues.1H Chemical Shifts. Proton chemical shifts are influenced by short range ring current effects from aromatic groups, magnetic anisotropy

of C=O and C-N bonds, and electric field effects arising from charged groups. Recent developments in empirical models for 1H chemical shiftcalculations have shown that it is now possible to predict 1H chemical shifts for nonexchangeable protons to within 0.23–0.25 ppm for proteinsfor which high resolution crystal structures are available (41, 42).

The calculated 1H chemical shift σcalc can be decomposed into four terms: the “random coil” (σrandom), “ring current” (σring), “magneticanisotropy” (σani), and “electric field” (σE) shifts (41). σring depends on the distance and orientation of the aromatic ring to the proton of interest.σani represents the sum of the anisotropies arising from the C=O and C-N bonds of the backbone and the side chain functional groups of Asp,Glu, Asn, and Gin and depends on distance (r–3) and orientation of the proton from these functional groups. Finally, σE depends on the distance(r–2) between the charged heavy atom

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5893

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 55: (NAS Colloquium) Computational Biomolecular Science

and the proton, the angle between the charged heavy atomproton and C-proton vectors, and the charge on the heavy atom.Direct refinement against 1H chemical shifts is carried out by adding a 1H chemical shift term, Eprot=Σ kprot (σcalc,i – σobs,i)2, where kprot is

the force constant and σobs,i and σcalc,i are the observed and calculated 1H chemical shifts, respectively, of proton i (12). Fornonstereospecifically assigned methylene and methyl groups, a modification of Eprot is required to make maximal use of the shift information(13). Specifically, this involves making use of a set of potentials that involve the sums and differences of the chemical shifts to automaticallyhandle chemical shifts involving prochiral centers without the need for making a priori stereospecific assignments (13).

Results of Refinement Against Three-Bond Coupling Constants and 13C and 1H Shifts. Provided there are no severe errors in theinterproton distance restraints, refinement against 3JHNα coupling constants, 13C shifts, and 1Hshifts reduces the rms difference betweencalculated and observed values to approximately the level of the expected errors (�0.5–1 Hz, �1 ppm, and �0.2–0.3 ppm, respectively) withoutsignificantly impairing the agreement with the other restraints in the target function (i.e., experimental interproton distance and torsion anglerestraints, covalent geometry, and non-bonded contacts) (10–13). In addition, provided the quality of the initial structures is high, refinementresults in only small overall atomic rms shifts with no increase in precision at the expense of accuracy.

We have found 13C shifts particularly useful in regions that are ordered but possess no regular secondary structure. Examples that come tomind are the N-terminal tail of the transcription factor GAGA (43) and the transcriptional coactivator HMG-I/Y (44) bound to the minor grooveof DNA. In such cases, the secondary 13C shifts permit one to exclude easily certain backbone conformations.

Whereas coupling constants and 13C shifts are related directly to specific torsion angles, 1H shifts are influenced by close spatial proximityof various functional groups and are particularly useful in the presence of aromatic groups. Indeed, 1H shift refinement was critical inestablishing the correct dimer interface in the structure of the C-terminal DNA binding domain of HIV-l integrase (45). Another example isprovided by Fig. 2, which illustrates the effect of 1H shift refinement, arising from the presence of a trypophan residue, on the active site ofreduced human thioredoxin (12).

Additional NMR Restraints that Define Long Range Order. Until recently, NMR structure determination has relied exclusively onrestraints whose information is entirely local and restricted to atoms close in space, specifically NOE-derived short (<5 Å) interproton distancerestraints, which may be supplemented by coupling constants, 13C secondary shifts, and 1H shifts as described above. The success of thesemethods is mainly due to the fact that short interproton distances between units far apart in a linear array are conformationally highlyrestrictive. However, there are numerous cases in which restraints that define long range order can supply invaluable structural information (16,17). In particular, they permit the relative positioning of structural elements that do not have many short interproton distance contacts betweenthem. Examples of such systems include modular and multidomain proteins and linear nucleic acids. Two approaches recently have beenintroduced that directly provide restraints that characterize long range order a priori. The first relies on the dependence of heteronuclear (15N or13C) longitudinal (T1) and transverse (T2) relaxation times, specifically T1/T2 ratios, on rotational diffusion anisotropy (16), and the secondrelies on residual dipolar couplings in oriented macromolecules (17, 18). The two methods provide restraints that are related in a simplegeometric manner to the orientation of one-bond internuclear vectors (e.g., N-H and C-H) relative to an external tensor. In the case of the T1/T2ratios, the tensor is the diffusion tensor (16). In the case of residual dipolar couplings, the tensor may be the magnetic susceptibility tensor formolecules aligned in a magnetic field (17), the molecular alignment tensor for molecules aligned by anisotropic media such as liquid crystals(46), the electric field tensor for molecules aligned by an electric field, or the optical absorption tensor for molecules aligned by polarized light.

FIG. 2. View of the active site and neighboring regions of reduced human thioredoxin showing a superposition of 40simulated annealing structure before (blue) and after (red) 1H chemical shift refinement. Reproduced from ref. 12.

Refinement Against T1/T2 Ratios. Heteronuclear relaxation has been used for a long time to provide information on internal dynamics.The 15N transverse relaxation time T2 is a function of frequency-dependent and -independent spectral density terms, whereas the 15Nlongitudinal relaxation time T1 is only a function of the frequency-dependent terms. For axially symmetric rotational diffusion(i.e.,DzzDxx=Dyy where Dzz, Dxx, and Dyy are the diagonal elements of the diffusion tensor) characterized by diffusion tensor constants parallel(D` =Dzz) and perpendicular (D` =[Dxx+Dyy]/2) to the unique axis of the diffusion tensor, the spectral density J(ω), in the limit of very fast,axially symmetric internal motions, is given by where ω is the angular resonance frequency, and S is thegeneralized order parameter for rapid internal motion; τ1, τ2, and τ3 are time constants given by (6D` )– 1, (D+5D` )–1, and (4D+ 2D` )–1; and theterms A1, A2, and A3 are given by (1.5cos2θ –0.5)2, 3sin2θcos2θ, and 0.75sin4θ, where θ is the angle between the time-averaged N-H bondvector orientation in the molecular frame and the unique axis of the diffusion tensor (47). In the absence of large amplitude internal motionsand conformational exchange line broadening, the 15N T1/T2 ratio for a protein with an axially symmetric diffusion tensor depends only onthree variables: the angle θ (arising from the Ak terms) and the diffusion tensor constants D` and D` . As described below, D` and D` . areextracted readily from the ensemble of 15N T1and T2 relaxation times.

Thus, the individual T1/T2 ratios provide a direct measure of the angle θ between the N-H bond vector and the unique axis of the diffusiontensor. This orientation is not known a priori, so we allowed it to float by making use of an external, initially arbitrarily positioned axis, definedby a single C-C bond, positioned 50 Å away from the structure (16). The geometric content of the T1/T2 ratios is incorporated into simulatedannealing refinement by adding the potential term Eanis=kanis[(T1/T2)calc–(T1/T2)obs]2, where kanis is a force constant and (T1/T2)obs and (T1/T2)calcare the observed and calculated values of T1/T2, respectively. At each step of the simulated annealing protocol, Eanis is evaluated by calculatingthe angle between the N-H vectors and the unique axis of the diffusion tensor, defined by the floating C-C bond vector. The desired target valuebetween observed and calculated T1/T2

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5894

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 56: (NAS Colloquium) Computational Biomolecular Science

ratios, based on the experimental uncertainty in the measured T1/T2 values, is achieved by empirically adjusting the value of kanis.To apply T1/T2 refinement, the values of D` and D` must be determined directly from the ensemble of measured T1/T2 ratios without

reference to a known structure. For a uniform distribution of N-H bond vectors in space, the probability of finding an N-H vector that makes anangle θ with the unique axis of the diffusion tensor is proportional to sinθ (16). Hence. θ values near 90° are statistically most probable. Theseare the amides that yield the lowest T1/T2 ratios. The probability of finding an N-H bond vector with θ �0° is low, and, consequently, the T1/T2ratio for θ=0° is not extracted as easily from the range of experimentally observed T1/T2 ratios. Experimentally, (T1/T2)min and an initialestimate of (T1/ T2)max are obtained by taking the average of the lowest and highest T1/T2 ratios, respectively, such that the SDs in theirestimates are equal to the measurement error. Initial estimates for D` and D` then are obtained by simultaneously best-fitting the completeequations describing (T1/T2)min, (T1/T2)max, and the ratio of these two terms. Because the initial estimate of (T1/T2)max is likely to underestimatethe true value of (T1/ T2)max, for the reasons discussed above, the estimated value of (T1/T2)max is increased in a stepwise manner (in incrementsof 5% up to a 35% increase) yielding new values of D` and D` . For each set of values, an ensemble of simulated annealing structures iscalculated, and the dependence of the rms difference between observed and calculated T1/T2 values. ∆(T1/T2), on the estimated value of (T1/T2)max is examined. The minimum of this function yields the best estimates for D` and D` . The minimum is relatively shallow, and the structure isnot significantly affected by using D` and D` values that change (T1/T2)max by up to ±15% but keep (T1/T2)min constant.

The general, fully asymmetric case, in which DxxDyy, is treated in an analogous manner (16). The 15N T1/T2 ratio then depends not onlyon the angle θ between the z axis of the diffusion tensor and the N-H vector orientation but also on the angle � that describes the position of theprojection of the N-H vector on the x-y plane, relative to the x axis. The rhombicity factor η is defined as 3/2(Dyy–Dxx)/[Dzz–0.5(Dyy+Dxx)]. Inpractice, for most proteins with large diffusion anisotropy, [2Dzz/(DXX+Dyy) ≥�1.5], η is found to be smaller than 0.4. Even at the high end ofthis range (η=0.4), the dependence of the T1/T2 ratio on � is relatively weak (introducing changes in the predicted T1/T2 ratio that are of amagnitude comparable to the uncertainty in the measurements). Although the effect of rhombicity of the diffusion tensor on the T1/T2 ratio isrelatively small, including its effect in the structure refinement, procedure does not pose any fundamental problem. In this case, the floatingdiatomic molecule, used above to describe the orientation of the diffusion tensor in the structure calculations for the axially symmetric case, isreplaced by an artificial tetraatomic molecule comprising atoms X, Y, Z, and O, with three mutually perpendicular bonds, X-O, Y-O, and Z-Ocorresponding to the x, y, and z axes of the diffusion tensor, respectively. Calculation of Eanis is completely analogous to the axially symmetriccase but uses the full, five-term expression for the spectral density. A set of structure calculations, carried out for a small number of η values(typically 0, 0.2. and 0.4) then indicates whether inclusion of rhombicity leads to better agreement with the experimental T1/T2 data. As pointedout above, however, the T1/T2 ratio is only a weak function of η, and the exact value of η often is defined poorly by the NMR data.

For the heteronuclear 15N T1/T2 method to be applicable, the molecule must tumble anisotropically (i.e., it must be nonspherical). Theminimum ratio of the diffusion anisotropy (D/D` ) for which heteronuclear T1/T2 refinement will be useful depends entirely on the accuracy anduncertainties in the measured T1/T2 ratios. In practice, the difference between the maximum and minimum observed T1/T2 ratio must exceed theuncertainty in the measured T1/T2 values by an order of magnitude. This typically means that D/D` should be greater than �1.5 (16).

Direct refinement against 15N T1/T2 ratios has been applied to the N-terminal domain of enzyme I (EIN), a 30-kDa protein of 259 residues(16). EIN is elongated in shape with a diffusion anisotropy of �2. As a result, the observed T1/T2 ratios range from �14 when the N-H vector isperpendicular to the diffusion axis to �30 when the N-H vector is parallel to the diffusion axis. EIN consists of two domains, and of the 2,818NOEs used to determine its structure, only 38 involve interdomain contacts (8). Refinement against the T1/T2 ratios resulted in a small changein the relative orientations of the two domains without perturbing the structures of the individual domains.

Refinement Against Residual Dipolar Couplings. The expression for the residual dipolar coupling δ(θ,�) between two directly bondednuclei can be simplified to the form δ(θ,�)= Da(3cos2θ–1)+3/2 Dr(sin2θ cos2�)], where Da and Dr are the axial and rhombic components of thetrace less diagonal tensor D given by 1/3 [D22–(Dxx+Dyy)/2) and 1/3 (Dxx– Dyy), respectively, with Dzz>Dyy≥Dxx; θ is the angle between theinteratomic vector and the z axis of the tensor; and � is the angle that describes the position of the projection of the interatomic vector on the x-y plane, relative to the x axis (48). Note that the terms Da and Dr subsume various constants including the gyromagnetic ratios of the two nuclei,the distance between the two nuclei, the generalized order parameter S for internal motion of the internuclear vector, the magnetic fieldstrength, and the medium permeability. [It is worth pointing out that, because Da and Dr scale with S and not S2, the assumption of a uniform Svalue introduces a negligible error of at most a few percent in the dipolar coupling providing S2≥0.6, particularly when one considers that S2

values in structured regions of a protein typically fall in the 0.85±0.05 range (17)].The applicability of the residual dipolar coupling method depends on the magnitude of the degree of alignment of the molecule in the

magnetic field (17). The magnetic susceptibility of most diamagnetic proteins is dominated by aromatic residues but also contains contributionsfrom the susceptibility anisotropies of the peptide bonds. The magnetic susceptibility anisotropy tensors of these individual contributors aregenerally not colinear, so the net value of the magnetic susceptibility anisotropy in diamagnetic proteins is usually small. Much larger magneticsusceptibility anisotropies are obtained if many aromatic groups are stacked on each other in such a way that their magnetic susceptibilitycontributions are additive, as in the case of nucleic acids. Hence, alignment induced by the magnetic field is suited ideally to nucleic acids andproteinnucleic acid complexes (17). In practice, the residual dipolar couplings must exceed the uncertainty in their measured values by an orderof magnitude, which typically means that the magnetic susceptibility anisotropy should be �–10×10–34 m3 per molecule, which is �10 timesgreater than that for benzene. This translates into values of Da obtained by measuring the difference in one-bond coupling constants at, forexample, 360 and 750 MHz of �0.5 Hz for N-H vectors and �0.9 Hz for C-H vectors. To obtain these values with sufficient accuracy requiresthat the one-bond couplings be measured by constant-time J-modulated correlation spectroscopy (49). More recently, it has been shown thathigh degrees of alignment in a magnetic field, corresponding to values of Da of �10 Hz for N-H vectors and 18 Hz for C-H vectors, can beachieved readily by the addition of dilute liquid crystalline media, while retaining the sensitivity and resolution of spectra recorded in isotropicmedia (46). As a result, it becomes feasible to measure several different types of residual dipolar couplings by

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5895

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 57: (NAS Colloquium) Computational Biomolecular Science

simply examining the splittings in 2D or 3D coupled correlation spectra. In particular, the much smaller residual couplings for other types ofinternuclear vectors, such as C�-N (10 times smaller than N-H) and Cα-C’ and C-C (�6 times smaller than N-H), are experimentally accessible.

The geometric content of the residual dipolar couplings is incorporated into the simulated annealing protocol by including the termEdipolar=kdipolar(δcalc–δobs)2, where kdipolar is a force constant and δcalc and δobs are the observed and calculated values of the residual dipolarcouplings, respectively (17). Just as for Eanis in the case of T1/T2 refinement, Edipolar is evaluated by calculating the θ and � angles between theappropriate bond vectors (e.g., N-H, Cα-H, or Cα-C�) and an external arbitrary axis system, defined by an artificial tetraatomic moleculecomprising atoms X, Y, Z, and O, with three mutually perpendicular bonds, X-O, Y-O, and Z-O, representing the x, y, and z axes of the tensor,respectively (17, 18).

To apply residual dipolar coupling refinement, the values of Da and the rhombicity R (defined as Dr/Da) must be determined directly fromthe experimental data (18). The minimum value of the residual dipolar coupling, δmin, occurs at θ=�= 90°, such that Da is given by -δmin/(1+1.5R). Experimentally, a reliable value of δmin is obtained by taking the average of the smallest residual dipolar couplings such that the SD ofthe estimated δmin value is equal to the measurement error. The maximum value of the residual dipolar coupling, δmax, which occurs at θ=0°, isgiven by 2Da. As in the case of the T1/T2 ratios discussed above (16), a reliable estimate of δmax is more difficult to obtain from theexperimental data because the probability of finding a bond vector with θ�0° is low. Consequently, if measurements are available for only asingle type of internuclear vector, the value of δmax, and hence the value of Da generally will be underestimated by 15–20%. Nevertheless, theobserved value of δmax still can be used to obtain an upper limit for the value of R given by [–2δmin(obs)/ δmax(obs)–1]/1.5 (18).

Because δmin can be determined accurately experimentally (for R<0.6) but Da cannot be obtained independently of R (unless a goodestimate of δmax is available), the strategy we use when residual dipolar couplings only have been measured for a single type of internuclearvector involves calculating a series of structure ensembles for different estimates of R. (Note that the rhombicity reaches a maximum value of2/3 when Dzz= –Dxx and Dyy=0; at this point the z and x axes are interchangeable so that the probability of finding a N-H vector perpendicularto the z axis is the same as finding one parallel to the z axis). The dependence of the rms difference between target and calculated dipolarcouplings on the estimated value of R (Rest) shows a minimum when Rest is approximately equal to the target value of R (Rtarget) (18). The sametype of dependence is observed for the total energy of the target function, reflecting not only the agreement between target and calculateddipolar couplings but also small changes in the agreement between target and calculated values of the other terms in the target function (18).

Because the distribution of the different vector types relative to the tensor is not identical, it becomes possible, once measurements areavailable for two or more types of internuclear vectors, to obtain reliable values of Da and R from the observed minimum (δmin), maximum(δmax), and most probable (δP) values of the normalized residual dipolar couplings. The residual dipolar couplings for different internuclearvectors are normalized readily because Da,CD=Da,AB(γCγD/ γAγB)(rAB

3/rCD3), where AB and CD are two types of internuclear vector (e.g., N-H

and Cα-H); γA, γB γC, and γD the gyromagnetic ratios of atoms A, B, C, and D, respectively; and rAB and rCD the internuclear A-B and C-Ddistances. A histogram of the normalized residual dipolar couplings displays a powder spectrum with the property that δmin+δmax+ δp=0. Thevalues of Da and R then can be obtained readily by least squares minimization of the following three equations: δmin(obs)=–Da(1+1.5R), δmax(obs)=2Da, and δp(obs)= –Da(1–1.5R). Indeed, model calculations with four different proteins of differing sizes and secondary structure contentindicate that, if the N-H, Cα-H. and Cα-C� residual dipolar couplings are measured for only 50% of the residues, Da and R can be determined inthis manner to within better than 5% and ±0.1, respectively, which is quite sufficient because variations in the estimated value of Da and R of ±10% and ±0.15 have a negligible effect on the calculated structures (18). If only residual dipolar couplings are measured for the NH and Cα-Hvectors, Da and R still can be determined to within an accuracy of better than 10% and ±0.15.

An example of the structural impact of residual dipolar coupling refinement is illustrated in Fig. 3 for the case of a complex of thetranscription factor GATA-1 with a 16-bp oligonucleotide (17). In this instance, the addition of only 90 dipolar coupling restraints to the�1,500 NOE and �300 torsion angle restraints resulted in a substantial improvement in the quality of the protein backbone, as judged by anapproximately twofold reduction in the number of residues lying outside the most favored region of the Ramachadran �, � plot (17). With theexception of a single region, the ensembles of structures calculated with and without dipolar couplings overlap (Fig. 3). There is, however, asubstantial displacement (accompanied by a maximal �4-Å rms shift in the backbone coordinates of residue 22) in the short loop (residues 21–24) that connects strands β3 and β4. Because this loop has low mobility, as judged from 15N relaxation data, this is a good example illustratingone of the principal shortcomings of NMR structure determination based on NOE measurements, namely an ill-defined region due to lack oflong range NOE restraints. The only NOEs observed for residues 22 and 23 are either intraresidue or sequential, and there are no long rangeNOEs involving residues 21 through 24. Hence, the precision of the backbone coordinates for this loop is lower than that for the α-helix and β-strands. Even though there are loose torsion angle restraints for the � and � angles of these residues, accumulation of errors in the experimentalrestraints (for example, an NOE interproton distance restraint that is slightly too short, even by as little as 0.1 Å) becomes an important

FIG. 3. View showing besifit superpositions of the restrained regularized mean coordinates obtained with and without dipolarcoupling restraints. The protein is shown as a ribbon diagram drawn through the Cα positions. The loop between strands β3and β4 (residues 21–24) is shown in magenta for the structure obtained with dipolar coupling restraints and in grey for thestructure obtained without dipolar coupling restraints. Adapted from ref. 17.

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5896

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 58: (NAS Colloquium) Computational Biomolecular Science

factor in determining the orientation of this loop with respect to the rest of the protein.Refinement with a Conformational Database Potential. In the context of simulated annealing refinement, it is found generally that

conventional nonbonded interaction terms (either attractive-repulsive or purely repulsive) have very poor discriminatory power between highand low probability local conformations (14). This can be circumvented by the use of a conformational database potential derived from highresolution, highly refined protein and nucleic acid crystal structures that bias the sampling during simulated annealing refinement toconformations that are energetically possible by limiting the choices of dihedral angles to those that are known to be physically realizable (14,15).

The database potential, which is partitioned into various one, two, three, and four dimensional distributions (Table 1), is created as follows(14). For each distribution, the fractional probability Pi for a residue to appear within a particular bin (with each dimension digitized inincrements of 8–10°) is converted into a potential of mean force EDB(i)=–kDB(lnPi), where kDB is a scale factor. Because the conformationaldatabase energy is not a continuous function but rather is known in discrete blocks, the partial derivatives are approximated in a manneranalogous to that used for 13C chemical shift potential term (11). To this end, the energy for every rotatable bond (or set of rotatable bonds)being refined against the conformational database potential is defined by looking up the value in the grid bin that encompasses the currentdihedral angle(s), and the partial derivatives of the energy with respect to the rotatable bond angles then are approximated by the local slope ofthe energy function, defined by ` EDB(�i)/`�� –kDB[EDB(�i– i)–EDB(�i+1)]/2, where EDB (�i) is the database energy of bin i along the rotatablebond �i and EDB(�i–1) and EDB(�i÷1) are the database energies of the bins that precede and follow the bin that contains the actual energy value.

Table 1. Summary of database potentialsA. ProteinsOne-dimensionalχ4 Arg, LysTwo-dimensional�/� Gly, Pro, X-Pro, H-bonding*. Val/Ile, restχ1/χ2 Leu, Ile, Gln/Glu, Arg/Lys/Met, Asn, Asp, Cys(ox), His, Trp, Phe/

Tyrχ2/χ3 Met, Gln, Glu, Lys, ArgThree-dimensional

Val, Ile, Phe/Tyr/Trp, Leu, X-Pro, Gln/Glu/Arg/Lys/ Met, Cys(red)/His/Asp/Asn, Ser, Thr, Cys(ox), Pro

χ1/χ2/χ3 Gln, Glu, Arg, Lys, Met

Four-dimensional B. Nucleic acidsTwo-dimensional

Three-dimensional

*Residues with a hydrogen bond donor or acceptor in the γ or δ position (Ser, reduced cysteine, Asp, Asn, Ser, and Thr).†The scale factor used for the interresidue potentials must be set to a value �10-fold lower thanthat for the intraresidue potentials; otherwise, undesirable bias in the structures may be introduced. Typically, the final value of the scale factor for theintraresidue conformational database potentials is set to 1.0.

It should be noted that there is one significant difference between the protein and nucleic acids conformational database potentials (15). Inthe case of the protein conformational database potential, the energy values for the various minima in the multidimensional potential energysurfaces provide a true reflection of the probability of occurrence of particular conformations because protein structures in solution and thecrystal state are essentially the same. In the case of nucleic acids, however, and in particular DNA. the frequency of occurrence of differentforms in the crystal state does not necessarily reflect their probability of occurrence in solution. For example, in solution under physiologicalconditions, short DNA oligonucleotides are invariably B-form. In the crystal, however, A, B, or Z-forms can occur depending on thecrystallization conditions. As a result, the A and Z forms of DNA are overrepresented in the database, and the energy values for the differentminima in the multidimensional potential energy surfaces comprising the nucleic acid conformational database potential do not necessarilyreflect their probability of occurrence in solution. This does not, however, affect the positions of the various minima so that, as far as structurerefinement is concerned, the nucleic acid conformational database potential still serves its primary function, namely biasing sampling toconformations that are realizable physically.

The effect of incorporating the conformational database potential into refinement is to improve the stereochemistry of the structures interms of the quality of the Ramachadran plot, the rotamer distributions, and the number of bad contacts (14, 15). If there are no significanterrors in the experimental restraints, conformational database refinement will not impact the agreement between the calculated and targetexperimental, covalent, and van der Waals restraints. The presence of errors in the experimental restraints, however, will be reflected by a largedeterioration in the agreement between calculated and target restraints upon conformational database refinement (14). Hence, incorporation ofthe conformational database provides a good indicator of the quality of both the model and the experimental restraints (14).

Some may regard the introduction of a conformational database energy term as a major step toward empiricism in NMR structurerefinement, adding a term with apparently no direct physical counterpart, whose effect will be to make the dihedral angle distributions in NMRrefined structures look more like those in crystal structures. However, the combined quality and quantity of high (≤2 Å) resolution proteinstructures in the crystallographic databases (50) argues strongly against such a viewpoint and makes it very difficult to ignore the availableexperimental observations relating to dihedral angles in proteins. First, it is invariably the case that high resolution x-ray structures showsignificantly better agreement with solution observables, such as coupling constants. 13C chemical shifts, and proton chemical shifts, than thecorresponding NMR structures, including the very best ones (obtained in the absence of direct coupling constant and chemical shift restraints) (10–13, 27, 28, 41, 42). Hence, in most cases, a high (≥2 Å) resolution crystal structure of a soluble globular protein will provide a better descriptionof the structure in solution than the corresponding NMR structure. Second, the probability distributions for the various dihedral angles observedin the crystallographic database are a direct result of the underlying physical chemistry of the system and as such provide a perfectlyreasonable, albeit empirically derived, measure of the relative energetics of different combinations of dihedral angles (14). Third, thediscriminating and converging power of the conformational database potential with regard to dihedral angles is significantly better than that ofthe currently available empirical nonbonded potentials. This point is hardly surprising because the conformational database potential acts

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5897

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 59: (NAS Colloquium) Computational Biomolecular Science

directly on rotatable bonds whereas the nonbonding potentials do not.A question that is invariably asked about the conformational database potential is whether one will be able to pick up unusual sidechain or

backbone conformations. Inspection of high resolution protein x-ray structures indicates that one safely can assume that 90–95% of all residueshave a sidechain conformation resembling that of a common rotamer (50). Under these conditions, residues that truly exhibit a skewed rotamerconformation will be spotted by specific discrepancies between the model and the experimental restraints, and in most circumstances suchviolations will be accounted for by special structural features of the model. Moreover, one should be especially careful in believing anonrotamer sidechain conformation in NMR structures in the absence of extensive NOE and coupling constant data relating to that particularresidue. Exactly the same arguments can be applied to �, � angles located in unfavorable regions of the Ramachandran plot, which likewiseshould be treated with extreme caution unless there is extensive experimental evidence to the contrary (50).

We thank Ad Bax, Dan Garrett, John Kuszewski, and Nico Tjandra for many stimulating discussions.1. Clore, G.M. & Gronenborn, A.M. (1991) Science 252, 1390– 1399.2. Wuthrich, K. (1986) NMR of Proteins and Nucleic Acids (Wiley, New York).3. Clore, G.M. & Gronenborn, A.M. (1987) Protein Eng. 1, 275–288.4. Dyson, H.J., Gippert, G.P., Case, D.A., Holmgren. A. & Wright, P.E. (1990) Biochemistry 29, 4129–4136.5. Forman-Kay, J.D., Clore, G.M., Wingfield, P.T. & Gronenborn, A.M. (1991) Biochemistry 30, 2685–2698.6. Clore, G.M., Wingfield, P.T. & Gronenborn, A.M. (1991) Biochemistry 30, 2315–2323.7. Clore, G.M. & Gronenborn, A.M. (1991) J. Mol. Biol. 221, 47–53.8. Garrett, D.S., Seok, Y.J., Liao, D.-I., Peterkofsky, A., Gronenborn, A.M. & Clore, G.M. (1997) Biochemistry 36, 2517–2530.9. Martin, J.R., Mulder, F.A.A., Karimi-Nejad, Y., van der Zwan, J., Mariani, M., Schipper, D. & Boelens, R. (1997) Structure 5, 521–532.10. Garrett, D.S., Kuszewski, J., Hancock, T.J., Lodi, P.J., Vuister, G.W., Gronenborn, A.M. & Clore, G.M. (1994) J. Magn. Reson. 104, 99–103.11. Kuszewski, J., Qin, J., Gronenborn, A.M. & Clore, G.M. (1995) J. Magn. Reson. 106, 92–96.12. Kuszewski, J., Gronenborn, A.M. & Core, G.M. (1995) J. Magn. Reson. 107, 293–297.13. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1996) J. Magn. Reson. 112, 79–81.14. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1996) Protein Sci. 5, 1067–1080.15. Kuszewski, J., Gronenborn, A.M. & Clore, G.M. (1997) J. Magn. Reson. 125, 171–177.16. Tjandra, N., Garrett, D.S., Gronenborn, A M., Bax, A & Clore, G.M. (1997) Nat. Struct. Bioi. 4, 443–449.17. Tjandra, N., Omichinski, J.G., Gronenborn, A.M., Clore, G.M. & Bax, A. (1997) Nat. Struct. Biol. 4, 732–738.18. Clore, G.M., Gronenborn, A.M. & Tjandra, N. (1998) J. Magn. Reson., 131, 159–162.19. Clore, G.M. & Gronenborn, A.M. (1989) CRC Crit. Rev. Biochem. Biol. Biol. 24, 479–564.20. Clore, G.M., Brünger, A.T., Karplus, M. & Gronenborn, A.M. (1986) J. Mol. Biol. 191, 523–551.21. Nilges, M., Clore, G.M. & Gronenborn, A.M. (1988) FEBS Lett. 229, 317–324.22. Stein, E.G., Rice, L.M. & Brünger, A.T. (1997) J. Magn. Reson. 124, 154–164.23. Havel, T.F. & Wuthrich, K. (1985) J. Mol. Biol l82, 381–394.24. Braun, W. (1987) Q. Rev. Biophys. 19, 115–157.25. Gronenborn, A.M. & Clore, G.M. (1995) CRC Crit. Rev. Biochem. Mol. Biol. 30, 351–385.26. Clore, G.M., Robien, M.A. & Gronenborn, A.M. (1993) J. Mol. Biol. 231, 81–102.27. Bartik, K., Dobson, C.M. & Redfield, C. (1993) Eur. J. Biochem. 215, 255–266.28. Wang, A.C. & Bax, A. (1996) J. Am. Chem. Soc. 118, 2483–2494.29. Luzzati, V. (1952) Acta Crystalhgr. 5, 802–810.30. Borgias, B.A., Gochin, M., Kerwood, D.J. & James, T.L. (1990) Progr. NMR Spectrosc. 22, 83–100.31. Yip, P. & Case, D.A. (1991) J. Magn. Reson. 83, 643–648.32. Nilges, M., Habbazettl, P., Brünger, A.T. & Holak, T.A. (1991) J. Mol. Biol. 219, 499–510.33. Brünger, A.T., Clore, G.M., Gronenborn, A.M., Saffrich, R. & Nilges, M. (1993) Science 261, 328–331.34. Karplus, M. (1963) J. Am. Chem. Soc. 85, 2870.35. Bax, A., Vuister, G.W., Grzesiek, S., Delaglio, F., Wang, A.C., Tschudin, R. & Zhu, G. (1994) Methods Enzymol. 239, 79–106.36. Hu, J.-S. & Bax, A. (1996) J. Am. Chem. Soc. 118, 8170–8171.37. Ottiger, M. & Bax, A. (1997) J. Am. Chem. Soc. 119, 8070–8075.38. Spera, S. & Bax, A. (1991) J. Am. Chem. Soc. 113, 5491–5492.39. Wishart, D.S. & Sykes, B.D. (1994) J. Biomol NMR 4, 171–180.40. Oldfield, E. (1995) J. Biomol. NMR 5, 217–225.41. Osapay, K.A. & Case, D.A. (1991) J. Am. Chem. Soc. 113, 9436–9444.42. Williamson, M.P. & Asakura, T. (1993) J. Magn. Reson. 101, 63–71.43. Omichinski, J.G., Pedone, P.V., Felsenfeld, G., Gronenborn, A.M. & Clore, G.M. (1997) Nat. Struct. Biol. 4, 122–132.44. Huth, J.R., Bewley, C.A., Nissen, M.S., Evans, J.N.S., Reeves, R., Gronenborn, A.M. & Core, G.M. (1997) Nat. Struct. Biol. 4, 657–665.45. Lodi, P.J., Ernst, J.A., Kuszewski, J., Hickman, A.B., Engelman, A., Craigie, R., Clore, G.M. & Gronenborn, A.M. (1995) Biochemistry 34, 9826–

9833.46. Tjandra, N. & Bax, A (1997) Science 278, 1111–1114.47. Woessner, D.E, (1962) J. Chem. Phys. 36, 647–654.48. Bothner-By, A.A. (1995) in Encyclopedia of Nuclear Magnetic Resonance, eds. Grant. D.M. & Harris, R.K. (Wiley, Chichester, U.K.), pp. 2932–

2938.49. Tjandra, N., Grzesiek, S. & Bax, A. (1996) J. Am. Chem. Soc. 118, 6264–6272.50. Kleywegt, G.J. & Jones, T.A (1997) Methods Enzymol 227, 208–230.51. Clore, G.M., Ernst, J.A, Clubb, R.T., Omichinski, J.G., Kennedy, W.M.P., Sakaguchi, K., Appella, E. & Gronenborn, A.M. (1995) Nat. Struct. Biol.

2, 321–332.52. Clore, G.M. & Gronenborn, A.M. (1991) J. Mol Biol. 217, 611–620.53. Szyperski, T., Güntert, P., Stone, S.R. & Wüthrich, K. (1992) J. Mol. Biol. 228, 1192–1205.54. Berndt, K.D., Günter, P., Orbons, L.P.M. & Wüthrich, K. (1992) J. Mol Biol. 227, 757–775.55. Hyberts, S.G., Goldberg, M.S., Havel, T.F. & Wagner, G. (1992) Protein Sci. 1, 736–751.56. Moore, J.M., Lepre, C., Gippert, G.P., Chazin, W.J., Case, D.A. & Wright, P.E. (1991) J. Mol Biol. 221, 533–555.57. Billeter, M., Kline, A.D., Braun, W., Huber, R. & Wüthrich, K. (1989) J. Mol Biol. 206, 677–687.58. Folkers, P.J. M., Clore, G.M., Driscoll, P.C., Dodt, J., Köhler, S. & Gronenborn, A.M. (1989) Biochemistry 28, 2601–2617.59. Spitzfaden, C., Braun, W., Wider, G., Widmer, H. & Wüthrich, K. (1994) J. Biomol. NMR 4, 463–482.60. Osapay, K., Theriault, Y., Wright, P.E. & Case, D.A (1994) J. Mol. Biol. 244, 183–197.61. Clore, G.M., Gronenborn, A.M., Nilges, M. & Ryan, C.A. (1987) Biochemistry 26, 8012–8023.62. Billeter, M., Vendrell, J., Wider, G., Aviles, F.X., Coll, M., Guasch, A., Huber, R. & Wüthrich, K. (1992) J. Biomolec. NMR 2, 1–10.63. Clore, G.M., Gronenborn, A.M., James. M.N.G., Kjaer, M., McPhalen, C.A. & Poulsen. F.M. (1987) Protein Eng. 1, 313–318.

NEW METHODS OF STRUCTURE REFINEMENT FOR MACROMOLECULAR STRUCTURE DETERMINATION BY NMR 5898

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 60: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5899–5905, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Estimation of evolutionary distances under stationary andnonstationary models of nucleotide substitution

(substitution models/rate heterogeneity/stationarity/nonstationarity/estimation bias)XUN GU* AND WEN-HSIUNG LI†‡

*Institute of Molecular Evolutionary Genetics, 328 Muellcr Laboratory. Pennsylvania State University, University Park, PA 16802: and†Human Genetics Center, SPH, Universitv of Texas, P.O.Box 20334, Houston, TX 77225

ABSTRACT Estimation of evolutionary distances has always been a major issue in the study of molecular evolution becauseevolutionary distances are required for estimating the rate of evolution in a gene, the divergence dates between genes or organisms, andthe relationships among genes or organisms. Other closely related issues are the estimation of the pattern of nucleotide substitution, theestimation of the degree of rate variation among sites in a DNA sequence, and statistical testing of the molecular clock hypothesis.Mathematical treatments of these problems are considerably simplified by the assumption of a stationary process in which thenucleotide compositions of the sequences under study have remained approximately constant over time, and there now exist fairly extensive studies of stationary models of nucleotide substitution, although some problems remain to be solved. Nonstationary modelsare much more complex, but significant progress has been recently made by the development of the paralinear and LogDet distances.This paper reviews recent studies on the above issues and reports results on correcting the estimation bias of evolutionary distances,the estimation of the pattern of nucleotide substitution, and the estimation of rate variation among the sites in a sequence.

Evolutionary distances (usually designated by d) such as the number of nucleotide substitutions between two DNA sequences (K) are basicquantities in the study of molecular evolution because they are required for computing the rate of evolution in a DNA or protein sequence, forinferring the evolutionary relationships among genes or organisms, and for estimating the divergence dates between taxa or genes (1–9). Forthese purposes, however, it is essential to obtain reliable estimates of evolutionary distances. Indeed, if the evolutionary distances are notaccurately estimated, all distance matrix methods of tree reconstruction may be misleading (5–6, 8). Because accurate estimation ofevolutionary distances requires a realistic model of nucleotide substitution, much effort has been made to develop general models of nucleotidesubstitution (4, 8).

If the process of nucleotide substitution is stationary, i.e., if the nucleotide compositions of the sequences under study have beenapproximately constant over time, then fairly general models of nucleotide substitution can be developed. For the stationary, time-reversiblemodel (the SR model), Lanave et al. (10), Gu and Li (11), and others (12–14) have developed methods for estimating K. This model includesmany other models as special cases (see next page). Moreover, Gu and Li (11) have recently extended the SR model to include rate variationamong sites, i.e., the SRV model, in which SRV stands for stationary, time-reversible, and rate-variable.

When nucleotide frequencies change with time so that stationarity does not hold, phylogenetic reconstruction using distances estimatedunder a stationary model can be misleading because it tends to group together sequences of similar nucleotide compositions irrespective of theirtrue evolutionary relationships (15–18). Nonstationarity greatly complicates the mathematics. Fortunately, significant progress has been madewith the development of the paralinear (19) and LogDet distances (17, 20). However, both methods assume a uniform rate among sites, and somethods for dealing with rate heterogeneity remain to be developed.

An issue related to the estimation of evolutionary distances is the estimation of the pattern of nucleotide substitution. This pattern can bereliably estimated under stationarity (21–23) but is difficult to estimate under nonstationarity. Another problem closely related to distanceestimation is how to estimate the degree of rate variation among sites (24–29). Many methods have been proposed for this purpose under aspecific distribution (e.g., a gamma distribution). However, how to estimate rate heterogeneity without assuming a specific distribution hasbeen unclear (30). These issues will be considered in this paper.

A further issue is that estimation bias usually occurs when the sequence length is short so that stochastic effects are strong. Although thebias tends to become trivial as the sequence length increases, it is desirable to correct the bias because in practice many sequences studied areactually very short (31–32).

The purpose of this article is to review recent studies on the above issues and to present our results.

Stationary Models

The SR Model. Assume that nucleotide substitution follows a stationary Markov process (10–14). Denote A, G, T, and C as 1, 2, 3, and 4,respectively. Let R be the rate matrix whose ij-th element rij is the rate of change from nucleotide i to nucleotide j (i�j, i, j=1, 2, 3, 4); thediagonal elements are given by rii=–∑j�i rij. Then the matrix of transition probabilities P for t time units is given by P=eRt, where the ij-thelement Pij is the probability of transition from nucleotide i to nucleotide j after t evolutionary time units.

The substitution process is reversible in time if and only if πirij=πjrji, where πi is the equilibrium frequency of nucle

‡To whom reprint requests should be addressed, e-mail: [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955899–7$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: SR, stationary time reversible: SRV. SR rate-variable; NR, time-irreversible; TR, time-reversible.

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5899

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 61: (NAS Colloquium) Computational Biomolecular Science

otide i. The preceding relation implies that the off-diagonal elements of R can be expressed as

Therefore, the SR model is a nine-parameter model and includes many models as special cases, e.g., the models of Jukes and Cantor (33),Kimura (34), Tajima and Nei (35), Hasegawa et al. (21), and Tamura and Nei (22). The SR model has been studied by many authors (10–14,23, 36).

Consider two sequences (designated by 1 and 2) that have evolved from O, their common ancestor, t time units ago (Fig. 1). Understationarity, time-reversibility means that the substitution process from the common ancestor O to sequences 1 and 2 is equivalent to thesubstitution process from 1 through O to 2 (or from 2 through O to 1), whose transition probability matrix for 2t time units is given by

P=e2tR. [1]Let λk (k=1, 2, 3, 4) be the k-th eigenvalue of the rate matrix R; one of them is zero, say λ4=0. Let zk be the k-th eigenvalue of P. Eq. 1

implies zk=e2tλk. Gu and Li (11) showed that the evolutionary distance defined by the average number of substitutions per site (i.e.,) is given by

where constants ck are determined by the eigenmatrix of P. Eq. 2 is generally valid since all eigenvalues zk are real under the SR model(11, 37). For example, under the Jukes-Cantor model (33), z1=z2=z3=1–4p/3 and c1=c2=c3=1/4 so that Eq. 2 is reduced to d=–(3/4)ln(1–4p/3),where p is the proportion of nucleotide differences between the two sequences.

The SR distance can be estimated from the data matrix J, whose ij-th element (Jij) is the frequency of sites at which the nucleotides in thetwo sequences are i and j, respectively. By time-reversibility, we have Jij=πiPij. Therefore, the ij-th element of P (for 2t time units) can beestimated by P�ij=Jij/πi (i,j=1,�, 4), where πi, and Jij are easily obtained from the sequence data. Let matrix P �� �� consist of P�ij. Its eigenvalues z �k (k=1,�, 3) can be computed by a standard algorithm, and the constants are given by (k=1, 2, 3), where uik and vkj are theelements of the corresponding eigenmatrix U and its inverse matrix V, respectively. For details, see Saccone et al. (38), Gu and Li (11), and Liand Gu (39). The sampling variance of d and the variance-covariance matrix for more than two DNA sequences can be found in Gu and Li (11).

FIG. 1. Two DNA sequences diverged t time units ago.

Eq. 2 can be used to define many additive distances by choosing appropriate constants ck (Table 1), e.g., the number of nucleotidesubstitutions per site (K), the number of transitional substitutions per site (A), the number of transversional substitutions per site (B), and thenumber of substitutions from nucleotides i to j (Dij). These distance measures are useful for phylogenetic analysis and molecular clock testing.

The SRV Model. Rate variation among sites can be incorporated into the SR model by assuming rij=aiju, where aij is a constant and uvaries according to a gamma distribution

[3]

with mean =α/β; α is the shape parameter and determines the degree of rate variation. Under this model, the (mean) transition probabilitymatrix P for 2t time units is given by

[4]

where I is the identity matrix and the mean rate matrix R �� ��= A where matrix A consists of aij (11). From Eq. 4, one can show that the k-theigenvalue of P is given by

[5]

where λk is the k-th eigenvalue of R. It follows that the evolutionary distance under the SRV model is given by

[6]

The constants ck are determined in the same manner as above (Table 1). Note that Eq. 4 reduces to Eq. 1 and Eq. 6 to Eq. 2 as α→, i.e.,the substitution rate is uniform among sites.

Furthermore, Eq. 6 can be generalized to any distribution f(u) for the rate variation among sites. Let G(s)= (u)du be the moment-generating function of f(u). Gu and Li (11) showed that zk=G(2λkt), k=1, 2, 3, 4. Thus, the general additive distance is given by

[7]

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5900

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

[2]

where G–1 is the inverse function of the moment-generating function G. For example, consider the invariant+gamma model (26, 40–41):(i) for a given site, the probability of being invariable (i.e., u=0) is θ, whereas the probability of being variable is 1—θ; and (ii) among the sitesthat are variable, the substitution rate follows a gamma distribution. By applying Eq.

Table 1. The constants ck in the general SR or SRV distance

K is the number of substitutions per site; A is the number of transitional substitutions per site; B is the number of tranversionalsubstitutions per site, and Dij is the number of substitutions from nucleotides i to j per site. The subscripts j�i � Ts and j�i �Tv mean that the differences between nucleotides i and j are transitional and transversional, respectively.

Page 62: (NAS Colloquium) Computational Biomolecular Science

7, one can show that the evolutionary distance under the invariant + gamma distribution is given by

[8]

For other distributions, see Waddell et al. (30).Bias-Corrected SR and SRV Distances. Our computer simulation has shown that when the sequence length is short the SR and SRV

methods tend to overestimate the evolutionary distance. The bias can be corrected as follows.Let d� be an estimate of the SR or SRV distance. We use the first three terms of the Taylor expansion to obtain an approximate expression

of E[d]. For the SR model, [9]

Therefore, the bias-corrected SR distance is given by

dc=d–δ, [10]

where δ is defined as[11]

and Var(z�k) can be obtained by the method of Gu and Li (11).The bias-corrected distance under the SRV model also can be written as Eq. 10, except that δ is replaced by

[12]

Computer Simulation. Extensive computer simulations on the performance of the SR and SRV methods have been conducted in thisstudy and in Rodriguez et at. (14), Zharkikh (31), and Gu and Li (11). The results can be summarized as follows.

(i) When the sequence length (L) is long and the rate of substitution is uniform among sites, the SR method performs well, whereassimpler methods [e.g., Kimura’s two-parameter method (34)] give biased estimates if some assumptions of the method areviolated (11, 14, 31). Because the actual substitution pattern of DNA evolution may be complex, the SR method is preferredwhen the sequences are long, say, longer than 1,000bp.

(ii) The SR method may give large biases when the sequence length is short (say, L≤200), but the biases can be substantially reducedby the bias-corrected SR distance (Table 2). As L becomes longer than 2,000 bp, the estimation bias virtually decreases to zero.The same comment applies to the SRV method (Table 3).

(iii) The SR method performs well even when DNA sequence evolution is not time-reversible (see models NR1 and NR2 in Table 2).Therefore, the assumption of time-reversibility, which simplifies the estimation problem considerably, may not have seriouseffects on distance estimation.

(iv) When the substitution rate varies among sites, the evolutionary distance can be seriously underestimated by the SR method: notethat this bias is systematic and cannot be eliminated by increasing sequence length. As shown in Table 3, the SRV methodperforms well and the estimation bias vanishes when L is long.

(v) The methods developed by Gu and Li (11) for estimating sampling variance under the SR and SRV models appear to be reliableexcept when L<200 and d>1.0.

(vi) The mean squared error defined by MSE=bias2+ Var(d) is useful for comparing the relative performance of two methods becausefor a simple method, the sampling variance tends to be smaller but the bias tends to be larger (11). For example, using thiscriterion, Gu and Li (11) found that SR is superior to JC when L>500 bp and that SRV is always superior to SR when thesubstitution rate varies among sites.

Table 2. The mean of distances (d) over simulation replicates estimated by the bias-corrected SR method and the SR methodSequence length (L)

Model 200 500 2000(1) d=0.5

JC 0.503 (0.506) 0.506 (0.516) 0.501 (0.502)K2P 0.507 (0.517) 0.502 (0.506) 0.501 (0.502)TN 0.508 (0.516) 0.504 (0.507) 0.501 (0.502)TmN 0.505 (0.516) 0.505 (0.509) 0.501 (0.502)SR 0.509 (0.517) 0.503 (0.506) 0.501 (0.502)NR1 0.509 (0.517) 0.503 (0.507) 0.501 (0.502)NR2 0.510 (0.517) 0.505 (0.509) 0.501 (0.502)

(2)d=1.0JC 1.036 (1.082) 1.013 (1.029) 1.005 (1.008)K2P 1.072 (1.093) 1.008 (1.038) 1.003 (1.009)TN 1.046 (1.089) 1.015 (1.037) 1.006 (1.010)TmN 1.061 (1.085) 1.016 (1.050) 1.005 (1.012)SR 1.049 (1.085) 1.006 (1.038) 1.005 (1.009)NR1 1.057 (1.090) 1.009 (1.044) 1.005 (1.011)NR2 1.071 (1.094) 1.015 (1.055) 1.006 (1.012)

The value presented in each case is the mean of d estimated by the bias-corrected SR method and the value in parentheses by the(uncorrected) SR method. Simulation models: JC, the Jukes-Cantor model (33). K2P, Kimura’s two parameter model (34): thetransition/ transversion ratio is 4. For TN (Tajima and Nei, Ref. 35), TmN (Tamura and Nei, Ref. 22), SR, and the two time-irreversible models (NR1 and NR2), see Gu and Li (11) for a detailed description.

Estimating the Pattern of Nucleotide Substitution. The pattern of nucleotide substitution can be measured by the off-diagonal elementsof the rate matrix R. For simplicity, these elements are usually rescaled, and here, we define the pattern of nucleotide substitution as R*=2tR.Consider two DNA sequences (Fig. 1) under the SR model. Denote the diagonal matrix of the eigenvalues of P=e2tR by diag(z1, Z2,

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5901

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Table 3. The mean of distance (d) estimated by the SRV method and the bias-corrected SRV method

L true d SR+V model NR2+V model(1) α=0.5200 0.3 0.317 (0.325) 0.320 (0.334)

0.5 0.520 (0.552) 0.555 (0.574)1.0 1.068 (1.179) 1.193 (1.303)

500 0.3 0.303 (0.307) 0.305 (0.310)0.5 0.508 (0.517) 0.510 (0.520)1.0 1.027 (1.061) 1.037 (1.077)

(2) α=1.0200 0.3 0.312 (0.318) 0.313 (0.319)

0.5 0.513 (0.528) 0.531 (0.544)1.0 1.038 (1.126) 1.063 (1.149)

500 0.3 0.306 (0.309) 0.306 (0.309)0.5 0.508 (0.514) 0.502 (0.507)1.0 1.013 (1.037) 1.022 (1.053)

(3) α=2.0200 0.3 0.307(0.311) 0.308 (0.312)

0.5 0.514 (0.526) 0.514 (0.524)1.0 1.046 (1.132) 1.060 (1.146)

5000.3 0.300 (0.302) 0.300 (0.301)0.5 0.503 (0.508) 0.502 (0.507)1.0 1.012 (1.034) 1.015 (1.043)

The value presented in each case is the mean of d estimated by the bias-corrected SRV method, and the value in parentheses by the (uncorrected) SRVmethod. See the note of Table 2 for details.

Page 63: (NAS Colloquium) Computational Biomolecular Science

z3, z4). By matrix theory, we have P=U diag(z1, z2, z3, z4)U–1, where U is the eigenmatrix of P. Then, the substitution pattern R*=2tR=In P canbe expressed as

R*=U diag(ln z1, In z2, In z3, In z4)U–1, [13]

Therefore, using the same procedure, we can estimate the evolutionary distance and the pattern of nucleotide substitution simultaneously.In the same manner, under the SRV model one can show that the pattern of nucleotide substitution can be estimated by

[14]

where (see also ref. 42).It is known that estimation of the pattern of nucleotide substitution can be significantly improved by using n>2 sequences, but the

estimation procedure becomes complex because it needs to consider the phylogenetic tree of the sequences, which may be unknown. Thefollowing simple method does not require knowledge of the tree topology. For a given pair of sequences i and; j, which diverged tij time unitsago, the transition probability matrix under the SR model is P(ij)=e2tijR. By multiplying P(ij) over all pairs of sequences, we have

[15]

where τ=∑i<j tij. Similarly, under the SRV model, one can show that

[16]

Therefore, when the transition probability matrix for each pair of sequences has been estimated, which is denoted by P �� ��ij, we first compute P �� ��(2τ)=IIi<j P �� ��ij. Then, under the SR or SRV model, the substitution pattern R*=2τR for n sequences can be estimated by an approach similar tothat for the case of two sequences. The sampling variances for the estimated substitution pattern can be obtained by the analytical methoddeveloped by Gu and Li (11) or by a simple resampling technique (e.g., bootstrapping).

When many sequences are considered for estimating the substitution pattern, the time scale τ in Eq. 16 can be very large, resulting in someelements in R* larger than one. Because we are more concerned with the relative rates among the types of nucleotide substitutions, it is betterto provide a normalized substitution pattern. A simple normalization procedure is to compute where M=n(n–1)/2 and theweight wij=1/dij

A General Measure of Rate Variation Among Sites

Gu et al. (26) suggested a normalized measure (ρ) for evaluating the relative strength of the rate variation among sites:

[17]

where Var(u) and are the variance and mean of the evolutionary rate (u) for any distribution f(u). As ρ varies from 0 to1, the rate heterogeneity increases from a uniform rate over sites (ρ=0 or Cv=0) to the maximum heterogeneity (ρ=1 or Cv=�). Therefore, ρ candirectly reflect rate heterogeneity, and unlike the shape parameter α of the gamma distribution, it does not depend on a specific distribution.

In the following we describe a simple method for estimating ρ without assuming a specific model for the rate variation among sites. Weassume (i) at each site nucleotide substitution follows a Poisson process, and (ii) the evolutionary rate u varies among sites according to thedistribution f(u). Let X be the number of substitutions at a nucleotide site with rate u. Then, the first two conditional moments of X are given byE[X|u]=uT and E[X2|u]=uT+(uT)2, respectively, where T is the total evolutionary time. It follows that the first two (unconditional) moments ofX over all sites are E[X]= E[E(X|u)]=TE[u], and E[X2]=E[E[X2|u]]=TE[u]+ T2E[u2], respectively, where E[u] and E[u2] are the first twomoments of f(u), respectively. Let m=E[X] and V=E[X2] –m2, and let =E[u] and Var(u)=E[u2]–( )2. One can show that m= T and V= T+Var(u)T2, and so Cv= Therefore, the parameter ρ is given by

[18]

To estimate ρ from sequence data, we need to know the number of substitutions at each site. Conventionally, this number is inferred by theparsimony method (43) when the phylogenetic tree is known. However, the parsimony method tends to underestimate the true number ofsubstitutions (29, 44). Gu and Zhang (29) solved this problem by using a combination of ancestral sequence inference and maximum likelihoodestimation. Let X�i be the number of substitutions at the ith site estimated by Gu and Zhansfs method (29). Then,

(L is the sequence length) so that ρ � can be easily obtained from Eq. 18 without knowing the distribution f(u).The biological meaning of ρ can be easily understood by using the following simple model. Let v be the mutation rate at a site. For

invariant sites, the substitution rate is 0, and for the other sites, the rate is hv, where 0<h≤1. The average substitution rate of the gene istherefore u=(1–θ)hv, where θ is the frequency of invariable sites. It is easy to show that Cv and ρ=θ. Thus, the substitution rate canbe expressed as

u=(1–ρ)hv. [19]

This formula predicts a negative correlation between substitution rate and the rate variation among sites, which has been observed byJ.Zhang and X.Gu (unpublished results).

Nonstationary Models

LogDet and Paralinear Distances. The paralinear (19) and LogDet (17, 20) distances have been proposed to deal with nonstationarity.They are based on the most general model of nucleotide substitution. Historically, these methods can be traced back to Barry and Hartingan(13) and Cavender and Felseinstein (45).

Consider the evolution of two sequences (Fig. 1). Denote the diagonal matrix of nucleotide frequencies at node k (k=0, 1, 2) by where the subscript; j refers to nucleotide j. Let J be the data matrix as defined previously. Then, the paralinear

distance (between sequences 1 and 2) is defined as

[20]

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5902

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 64: (NAS Colloquium) Computational Biomolecular Science

In Eq. 21, the constant –In 4 is added because it does not change any property of the original LogDet distance but makes the biologicalinterpretation easier (32). The paralinear and LogDet distances have the following properties:

(i) Both distances are based on the most general model of nucleotide substitution, i.e., the 12-parameter model (17, 19–20, 31).Moreover, they are valid even if the rate matrix R varies among lineages. Therefore, in the case where the assumption of auniform substitution rate among sites holds, the paralinear and LogDet distances are very useful for phylogenetic reconstructionwhen nucleotide frequencies are nonstationary (19–20, 32).

(ii) For the neighbor-joining method and related methods, the two distance measures give the same tree topology (32). However,there are some differences between the two distances. First, the paralinear distance between two sequences is the sum of“paralinear” lengths of the branches involved. Thus, the branch lengths under a given tree can be well estimated from theparalinear distance matrix by the least-squares method. In contrast, this property does not hold for the LogDet distance. Second,the LogDet distance is particularly useful for testing the molecular clock hypothesis under non-stationarity, whereas theparalinear distance is not suitable for this purpose (see Eqs. 27 and 28).

(iii) The biological interpretation of the two distances can be described as follows. Let be the arithmetic mean ratein lineage k(k=1, 2), and µ =(µ(1)+µ(2))/2. Gu and Li (32) showed that the expected paralinear distance (Eq. 20) is given by

[22]

and the expected LogDet distance (Eq. 21) is given by[23]

Note that, when the nucleotide frequency is stationary, Eq. 22 reduces to d=2µt. which is the expected number ofsubstitutions between the two sequences and is equivalent to the SR distance with ck=1/4 (Eq. 2). Eq. 23 reduces to d=2µt if

(iv) The approximate sampling variance of the paralinear distance is given by

[24]

and that of the LogDet distance is given by[25]

where L is the sequence length and Mij is the ij-th element of M=J–1 (13, 20, 32). For more than two sequences, the methodfor computing the variance-covariance matrix of the two distances has been developed by Gu and Li (32).

Bias-Corrected Paralinear and LogDet Distances. Because the data matrix J and the nucleotide frequencies can be directly estimatedfrom the sequence data, the estimation of paralinear and LogDet distances is simple (19–20). However, our simulation study has revealed thatthe true (paralinear or LogDet) distance can be overestimated when the sequences are short (32), a situation similar to the SR/SRV distance. Guand Li (32) obtained the following bias-corrected paralinear or LogDet distance.

d �c=d �–2 Var(d �), [26]

where d � and Var(d�) are the estimates of the “standard” paralinear or LogDet distance and the sampling variance, respectively (see Eqs. 20,21, 24, 25).

The performance of the bias-corrected distances has been examined by extensive computer simulation (32). We considered two DNAsequences (Fig. 1) that evolve under a very general model: in one lineage the nucleotide substitution follows a time-reversible model (TR) andin another lineage it follows a time-irreversible model (NR). The rate matrices of TR and NR are designed to be very different, and theequilibrium GC% is 70% in TR but only 17% in NR (see ref. 32 for the detail). Moreover, The initial GC% at node O (Fig. 1) is set to be 15%,50%, and 70%, in three cases. Our simulation results indicate that, when the sequence length is short, the bias-corrected paralinear or LogDetdistance performs considerably better than the uncorrected method (Table 4).

Testing the Molecular Clock Hypothesis Under Nonstationarity. The relative rate test (2) can be described as follows. Consider threespecies as shown in Fig. 2, where species 3 is an outgroup. To test whether the evolutionary rate in lineage O1 is the same as that in lineage O2(i.e.. the molecular clock hypothesis), one tests whether or not the difference D=d13 –d23 is significantly different from zero. Wu and Li (2), Guand Li (46), Muse and Weir (47), Tajima (48), and others have developed tests for the case of stationarity. When the nucleotide frequencies arenonstationary, D�0 can arise from differences in nucleotide frequencies between the two sequences. Gu and Li (32) showed that this problemcan be avoided by using the LogDet distance; that is,

D=d13–d23=(µ(1)–µ(2))t, [27]

where t is the divergent time between species 1 and 2 (Fig. 2). To test whether D is significantly different from zero, one can estimate thesampling variance of D, V(D)=V(d13)+V(d23) –2 Cov(d13, d23) by the method of Gu and Li (32). When the sequence is long, the statistic

follows approximately the standard normal distribution (2). Actually, this new relative rate test can be easily generalized to thetwo-cluster

Table 4. Statistical performances of the bias-corrected paralinear distance

Initial GC% L d� d �c d�2µt=0.550% 200 0.486 0.488 (0.4%) 0.497 (2.3%)

500 0.486 0.489 (0.6%) 0.492 (1.2%)2,000 0.486 0.487 (0.2%) 0.488 (0.4%)

70% 200 0.555 0.556 (0.2%) 0.572 (3.1%)500 0.555 0.557 (0.4%) 0.563 (1.4%)2,000 0.555 0.555 (0.0%) 0.557 (0.4%)

15% 200 0.607 0.599 (1.3%) 0.637 (4.9%)500 0.607 0.602 (0.8%) 0.613 (1.0%)2,000 0.607 0.609 (0.3%) 0.611 (0.7%)

2µt=0.850% 200 0.770 0.766 (0.5%) 0.791 (2.7%)

500 0.770 0.768 (0.3%) 0.777 (0.9%)2,000 0.770 0.770 (0.0%) 0.772 (0.3%)

70% 200 0.858 0.842 (1.9%) 0.890 (3.7%)500 0.858 0.854 (0.5%) 0.868 (1.2%)2,000 0.858 0.859 (0.1%) 0.862 (0.5%)

15% 200 0.926 0.880 (5.0%) 0.986 (6.5%)500 0.926 0.918 (0.9%) 0.946 (1.2%)2,000 0.926 0.925(0.1%) 0.930 (0.5%)

L is the sequence length; d is the true value of the paralinear; d �c and d are the means of d estimated by the bias-corrected and uncorrected paralineardistances. The percentage values in parentheses are the biases of d�c(i.e.,|d�c–d�|/d �×100%), and d �(i.e.,|d �–d�|/d�×100%), respectively.

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5903

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 65: (NAS Colloquium) Computational Biomolecular Science

test of Li and Bousquet (49) and Takezaki et al. (50), who considered the case of stationarity (Gu and Li, unpublished data).

FIG. 2. The phylogeny used for molecular clock testing.

On the other hand, if dij is measured by the paralinear distance, one can show that D'=d13–d23 is given by

[28]

Obviously, D' is affected by differences in nucleotide frequencies and thus not suitable for testing the molecular clock hypothesis.

Discussion

In the above, we discussed the estimation of evolutionary distances and related issues under three models of nucleotide substitution: theSR model (10–14, 36), the SRV model (11), and the nonstationary model (13, 17, 19–20, 32, 45). The conclusions can be summarized asfollows. (i) Under stationarity, the evolutionary distances and the pattern of nucleotide substitution can be estimated under the SR or SRVmodel, (ii) When the nucleotide frequencies are nonstationary, the paralinear or LogDet distances should be used. However, although bothdistances lead to the same tree topology, the branch lengths of a tree can be appropriately estimated only from the paralinear distances, whereasthe molecular clock hypothesis should be tested by the LogDet distance. (iii) The proposed bias-corrected methods for the SR/SRV andparalinear/ LogDet distances are useful when the sequences are shorter than 500 bp. (iv) A general measure for the rate variation among sites isproposed, which does not depend on any specific distribution of rates.

In principle, the SR/SRV and paralinear/LogDet distances can be easily extended to more complex models in which the dimension of therate matrix R is >4 (51–55). Two interesting cases are the amino acid-based model (a general 20×20 model) and the codon-based model (ageneral 61×61 model). However, our preliminary simulation showed that, even for the amino-acid based model, these distances are subject tolarge sampling variances unless the sequence is very long, say, larger than 2,000 amino acids; the sampling variance would be much larger forthe codon-based model. Indeed, because there are too many unknown parameters, the distances cannot be estimated accurately. Thus, oneshould be cautious when applying these methods to analyze amino acid sequence data.

We suggested to use ρ (related to the coefficient of variation Cv) as a general measure of rate heterogeneity. However, Waddell et al. (30)questioned its usefulness because they found, for a given sequence data set, the estimated Cv value differs under different assumptions of ratedistribution. This dilemma has now been removed because we have developed a method for estimating ρ (or Cv) that does not require anyspecific model of rate distribution. Apparently, the discrepancy found by Waddell et al. (30) is caused by sampling errors or the unsuitability ofthe model.

When the nucleotide frequencies are not stationary, the parlinear and LogDet methods provide concise and elegant distance measures forphylogenetic inference and molecular clock testing. However, how to incorporate the effect of heterogeneity into these two distances is aproblem that remains to be solved.

This study was supported by National Institutes of Health Grants GM 30998 (to W.H.L.) and GM 20293 (to Masatoshi Nei, PennsylvaniaState University).1. Li, W.H., Wu, C.I. & Luo, C.C. (1985) in Molecular Evolutionary Genetics, ed. MacIntyre, R.J. (Plenum, New York), pp. 1–94.2. Wu, C.I. & Li, W.H. (1985) Proc. Nati Acad. Sci. USA 82, 1741–1745.3. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406–425.4. Nei, M. (1987) Molecular Evolutionary Genetics (Columbia Univ. Press, New York).5. Nei, M. (1996) Annu. Rev. Genet. 30, 371–403.6. Felsenstein, J. (1988) Annu. Rev. Genet. 22, 521–565.7. Doolittle, R.E., Feng, D.F., Tsang, S., Cho, G. & Little, E. (1996) Science 271, 470–477.8. Li, W.H. (1997) Molecular Evolution (Sinauer, Sunderland, MA).9. Gu, X. (1997) Mol Biol. Evol. 14, 861–866.10. Lanave, C., Preparata. G., Saccone, C. & Serio, G. (1984) J. Mol. Evol. 20, 86–93.11. Gu, X. & Li, W.H. (1996) Proc. Natl. Acad. Sci. USA 93, 4671–4676.12. Tavare, S. (1986) Lect. Math. Life Sci. 17, 57–86.13. Barry, D. & Hartigan, J.A. (1987) Biometrics 43, 261–276.14. Rodriguez, F., Oliver, J.F., Marin, A. & Medina, J.R. (1990) J. Theor. Biol. 142, 485–501.15. Hasegawa, M. & Hashimoto, T. (1993) Nature 361, 23.16. Sogin, M.L., Hinkle, G. & Leipe, D.D. (1993) Nature 362, 795.17. Steel, M.A. (1994) Appl. Math. Lett. 7, 19–24.

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5904

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 66: (NAS Colloquium) Computational Biomolecular Science

18. Galtier, N. & Gouy, M. (1996) Proc. Natl. Acad. Sci. USA 92, 11317–11321.19. Lake, J.A. (1994) Proc. Natl. Acad. Sci, USA 91, 1455–1459.20. Lockhart, P.J., Steel, M.A. Hendy, M.D. & Penny, D. (1994) Mol. Biol. Evol. 11, 605–612.21. Hasegawa, M., Kishino, H. & Yano, T. (1985) J. Mol. Evol. 22, 160–174.22. Tamura, K. & Nei, M. (1993) Mol. Biol. Evol. 10, 512–526.23. Yang, Z. (1994) J. Mol. Evol. 39, 105–111.24. Uzzel, T. & Corbin, K.W. (1971) Science 172, 1089–1096.25. Yang, Z. (1993) Mol Biol. Evol. 10, 1396–1401.26. Gu, X., Fu, X.Y. & Li, W.H. (1995) Mol. Biol. Evol. 12, 546–557.27. Sullivan, J.K., Holsinger, K.E. & Simon, C. (1995) Mol. Biol. Evol. 12, 988–1001.28. Kelly, C. & Rice, J. (1996) Math. Biosci. 133, 85–109.29. Gu, X. & Zhang, J. (1997) Mol. Biol. Evol. 14, 1106–1113.30. Waddell, P.J., Penny, D. & Moore, T. (1997) Mol. Phylogenet. Evol. 8, 33–50.31. Zharkikh, A. (1994) J. Mol. Evol. 39, 315–329.32. Gu, X. & Li, W.H. (1996) Mol. Biol. Evol. 13, 1375–1383.33. Jukes, T.H. & Cantor, C.R. (1969) in Mammalian Protein Metabolism, ed. Munro, H.N. (Academic. New York), pp. 21–123.34. Kimura, M. (1980) J. Mol. Evol. 16, 111–120.35. Tajima, F. & Nei, M. (1984) Mol. Biol. Evol. 1, 269–285.36. Steel, M., Szekely, L. & Hendy, M. (1994) J. Comp. Biol. 1, 153–163.37. Keilson, J. (1979) Markov Chain Models: Rarity and Exponentially (Springer, New York).38. Saccone, C., Lanave C., Pesole, G. & Preparata, G. (1990) Methods Enzymol. 183, 570–583.39. Li, W.H. & Gu, X. (1996) Methods Enzymol. 266, 449–459.40. Miyamoto, M.M. & Fitch, W.M. (1996) Syst. Biol. 45, 568–575.41. Tourasse, N. & Gouy, M. (1997) Mol. Biol. Evol. 14, 287–298.42. Yang, Z. & Kumar, S. (1996) Mol. Biol. Evol. 13, 650–659.43. Fitch, W.M. (1971) Syst. Zool. 20, 406–416.44. Wakeley, J. (1993) J. Mol. Evol. 37, 613–623.45. Cavender, J.A. & Felsenstein, J. (1987) J. Classification 4, 57–71.46. Gu, X. & Li, W.H. (1992) Mol. Phvlogenet. Evol. 234, 185–192.47. Muse, S.V. & Weir, B.S. (1992) Genetics 132, 269–276.48. Tajima, F. (1993) Genetics 135, 599–607.49. Li, P. & Bousquet, J. (1992) Mol. Biol. Evol. 9, 1185–1189.50. Takezaki, N., Rzhetsky, A. & Nei, M. (1995) Mol. Biol. Evol. 12, 823–833.51. Dayhoff, M.O. (1978) Atlas of Protein Sequence and Structure (Natl. Biomed. Res. Found., Silver Spring, MD), Vol. 5.52. Schoniger, M. & von Haeseler, A. (1994) Mol. Phylogenet. Evol. 3, 240–247.53. Golding, N. & Yang, Z. (1994) Mol. Biol. Evol. 11, 725–736.54. Muse, S.V. & Gaut, B.S. (1994) Mol. Biol. Evol. 11, 715–724.55. Rzhetsky, A. (1995) Genetics 141, 771–783.

ESTIMATION OF EVOLUTIONARY DISTANCES UNDER STATIONARY AND NONSTATIONARY MODELS OF NUCLEOTIDESUBSTITUTION

5905

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 67: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95. pp. 5906–5912. May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Precise sequence complementarity between yeast chromosome endsand two classes of just-subtelomeric sequences

(Saccharomyces cerevisiae/inverted recombination/sequence exchange/telomeres/Y���� and X2 repeats)ROY J.BRITTEN*Division of Biology, California Institute of Technology, 101 Dahlia Avenue, Corona del Mar. CA 92625ABSTRACT The terminal regions (last 20 kb) of Saccharomyces cerevisiae chromosomes universally contain blocks of precise

sequence similarity to other chromosome terminal regions. The left and right terminal regions are distinct in the sense that thesequence similarities between them are reverse complements. Direct sequence similarity occurs between the left terminal regions andalso between the right terminal regions, but not between any left ends and right ends. With minor exceptions the relationships rangefrom 80% to 100% match within blocks. The regions of similarity are composites of familiar and unfamiliar repeated sequences as wellas what could be considered “single-copy” (or better “two-copy”) sequences. All terminal regions were compared with all other chromosomes, forward and reverse complement, and 768 comparisons are diagrammed. It appears there has been an extensive historyof sequence exchange or copying between terminal regions. The subtelomeric sequences fall into two classes. Seventeen of thechromosome ends terminate with the Y���� repeat, while 15 end with the 800-nt “X2” repeats just adjacent to the telomerase simplerepeats. The just-subterminal repeats are very similar to each other except that chromosome 1 right end is more divergent

Once the complete Saccharomyces cerevisiae DNA sequence became available (1, 2) it appeared worthwhile to see if an insight into theorigin of repeated sequences could be obtained, since all repeated sequences of this yeast strain are available for examination in the completesequence. The initial stage has been to examine the terminal regions, and that is what is reported here. Naturally the results overlap the manyprevious studies, but they differ from what has been published by the completeness of the examination of the terminal relationships. There havebeen extensive examinations of yeast telomeres and of pairing and recombination processes revealing extensive regions of subterminalsequence relationships, but they will not be reviewed here and reference is made to previous reviews (3–8).

There is of course a question as to what can be learned by merely examining sequence similarities, so this work is an experiment, but thereare new results of significance. Here the telomeric and subtelomeric sequences are referred to together as the terminal regions, which includethe last 20 kb of each chromosome. By custom, numbering starts at the left end. Many left terminal region sequences are the reversecomplement of some right terminal region sequences, and no cases occur of significant lengths of precise direct sequence similarity betweenany parts of left and right terminal regions among all of the chromosomes. However this observation does not signify that there is a consistentorientation of the arbitrary historical identification of the ends of the yeast chromosomes. The situation is clarified if, for test purposes, thenumbering of a chromosome is reversed. After this test change all left end sequences would still be the reverse complement of the right endsequences. Only a few specific relationships would change, while the reverse complementary pattern as a whole would remain unchanged. Thereverse complementary relationships between chromosome ends have been clear to some workers (e.g., ref. 9) but I am not aware of adiscussion of their significance.

RESULTS

Initial Tests of Terminal Reverse Complementarity. To test the general occurrence of reverse complementary relationship betweenchromosome terminal regions, the reverse complement of a 5-kb terminal segment of each of the left ends of the 16 chromosomes wascompared with all of the chromosomes, using FASTA (10). Reverse complementary regions were observed at the right end of severalchromosomes for each of the 16 searches. Always one or more examples extended all the way to the right end of the chromosome. A similarstudy was carried out for all chromosomes with probes that were the reverse complement of the terminal 2-kb right ends. Also in every caseseveral chromosomes were found with left terminal regions the reverse complement of the right terminal regions, and in every case thesequence similarity of at least one chromosome extended all the way to the left end. The program that was used, FASTA, selected the best-fittingregions, as it is designed to do, and the regions of similarity were often more extensive than exhibited in this initial search. To avoid missingsignificant similarities the best-scoring chromosomes found by the left end probes were divided into 1-kb-long fragments to form libraries thatwere searched with long probes that consisted of the reverse complement of the left ends of each of the 16 chromosomes. In this way all of thesignificant regions of reverse complementary sequence similarity were determined, often broken by internal nonmatching regions. The resultsare diagrammed in Fig. 1. Fig. 1 shows the right ends of the chromosomes with the matching region of the left end of the matchingchromosome identified and the percent similarity (reverse complement) for 1-kb regions listed.

In 16 of 16 cases there are high quality reverse complement sequence similarities of left terminal regions at the right ends of differentchromosomes. Often several chromosomes share in this sequence similarity, since there are extensive direct terminal region similarities amongthe same ends of the yeast

*To whom reprint requests should be addressed, e-mail: [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955906–7$2.00/0PNAS is available online at http://www.pnas.org.Abbreviation: SGD. Saccharomyces Genome Database.

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5906

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 68: (NAS Colloquium) Computational Biomolecular Science

chromosomes. While these terminal regions include repeated sequences, they are not entirely composed of them. They include regionsconsisting of just the few copies resulting from the terminal relationships reported here. The precision of match often reaches 100% in the best-matching regions as shown in Fig. 1. In many cases the central parts of the overlapping regions have the highest precision, and in almost everycase one or another of the end parts of the overlap have

FIG. 1. Selected subset of right end sequences showing extensive similarity (reverse complement) to each of the 16 left ends.Long segments of the left ends of all 16 chromosomes were converted to reverse complement and compared with libraries of1-kb segments of particular chromosomes, chosen on the basis of previous comparisons. The left column is the number of thechromosome from which the left end probe was extracted. The second column is the number of the chromosome with whichit was compared. The top of the figure shows the position in the right end sequence in kilobases. The first example is forchromosome (chr) 1 left end compared with chr 14 and exhibits a just over 3-kb reverse complementary region terminating atthe right end of chr 14. The double line shows the length of the matches found by FASTA and the % match is shown below.The next example, left end of chr 2 on right end of chr 8, is a case of Y� repeat similarity and exhibits minor internal deletionsor very poorly matching regions. The next example, chr 3 left end on chr 11, does not involve the Y� repeat and is mostlymade up of single-copy (2 copy) sequences. The next example, chr 4 left end (reverse complement) on chr 10, matches for18.5 kb and requires two lines for display with the leftward part on the upper line, a very extensive reverse complementaryregion. This figure displays a subset of reverse complementary regions that does not necessarily include the longest and best-matching regions.

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5907

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 69: (NAS Colloquium) Computational Biomolecular Science

a lower precision. This suggests that multiple events of sequence exchange are responsible. In every case the telomeric end of the sequencesimilarity ceases only at the end of the known sequence.

FIG. 2. Terminal region similarity to the Y� repeat. The probe was 8 kb of the right end of chromosome 15. By using FASTA itwas compared with a library of the whole yeast genome divided into 1-kb segments. (Upper) All of the direct sequencematches of the 16 chromosomes. All of the significant matches are at the right terminal regions that are displayed. At the topis a scale exhibiting kilobases from the ends. (Lower) Reverse complementary matches, all of which are in the left terminalregions. The symbols indicate the quality of match over the regions identified by FASTA. XXX represents greater than 94%match, while 999 represents from 85% to 94%. The other symbols follow the same pattern down to 666, which representsbetween 55% and 65% match. All of the longer matches are with the Y�-containing chromosome ends. The short matches atthe ends are the X2 repeats to be described in a later section.

A variety of searches indicate that there are no long direct sequence similarities between regions near (within 20 kb) the left and those nearthe right end of the same or any other chromosome. However, there are a few short and imprecise direct sequence similarities between theopposite ends of different chromosomes (200–400 nt and 70% identity or less). These appear to be members of short repeated sequencefamilies that are near the termini. Their presence is almost certainly the result of a different set of phenomena from the long and precisesequence similarities. Often there are about 5-kb-long precise sequence similarities, both forward and reverse complement, between the TYelements at locations spread through the yeast chromosomes. The terminal regions include isolated long terminal repeats (LTRs) of theseelements but none of the complete elements.

Genes in the Duplicated Terminal Regions. The presence or absence of genes in the regions shown in Fig. 1 was explored by using thedetailed maps and tabular information of the Stanford Saccharomyces Genome Database (SGD: see acknowledgements). No genes were foundthat meet the stiff criterion that the gene must be genetically mapped as well as confirmed by the DNA sequence. However many ORFs,ranging up to more than 5 kb in length, are present in the terminal regions. The 18.5-kb inverted sequence similarity between the left end ofchromosome 4 and the right end of chromosome 10 (Fig. 1) includes two genes identified on the basis of very good sequence similarity. Theseare listed in the SGD as related genes in the respective terminal regions. Only one of the regions listed in Fig. 1 does not have an ORFrecognized and listed in the SGD, but the reverse complement on the matching chromosome in this case does have a listed ORF. The regionsreported below, such as the extensive reverse complementary region between the left end of chromosome 16 and the right end of chromosome15. share genes that are recognized as closely related in the SGD. In strains that carry the SUC gene it is found in this region between the X and Y� repeats (3). In addition the Y� repeats contain conserved ORFs that are expressed in meiosis (3). Recently the duplication of genomic ORFshas been examined (11), and

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5908

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 70: (NAS Colloquium) Computational Biomolecular Science

some of the complementary regions described here are reported in ref. 11 because they include ORFs.Y���� Repeat Terminal Patterns. It is a matter of interest how much the known repeated sequences contribute to the terminal sequence

similarities. Fig. 2 shows the sequence similarities to a particular Y� repeat that occurs at the right terminal region of chromosome 15. All of theleft end copies are the reverse complement of the right end copies, and all of the copies appear to be terminal in occurrence except that in somecases such as chromosome 12 right end there are additional inner copies. The Y� repeat and other sequences in its region form a part of thesimilarities observed in this work. In some cases large lengths of Y� sequence similarity (reverse complement) occur at both ends(chromosomes 5, 8, 12, and 16) but often there is Y� sequence only at one end. The Y� repeat occurs at about half of the ends and theexpectation, if the chromosome ends that contain Y� repeats were a random set, is that they would occur at both ends of four chromosomes, asobserved. There are also three chromosomes for which both ends lack the Y� repeats, another observation that suggests randomness. Ninechromosomes include a Y� repeat at one end but lack it at the other (9).

Terminal Relationships as a Whole. To examine the terminal 20-kb sequence similarities as a whole, a large number of comparisons arerequired. Each chromosome left end was compared as reverse complement against the full length of all chromosomes. Direct sequence searcheswere also made with the left and right terminal regions, adding up to 768 comparisons. The results are shown diagramatically in Fig. 3. Tomake these comparisons the whole yeast genome was divided into about 12,000 1-kb segments in a library. FASTA was used to compare each ofthe terminal region 20-kb segments with the library. With this method all occurrences of significant sequence similarity are detected. Fig. 3shows the regions of similarity to the reverse complement of the left end of each of the 16 chromosomes. The upper left block is the set of 16similarities of the chromosome 1 left terminal 20-kb region (reverse complement). The next block below is the same for chromosome 2 andbelow that chromosome 3, etc. Fig. 3 Right gives this information for chromosomes 9–16. These similarities all occur in the right terminalregions of the 16 chromosomes. With the aid of a hand lens you will notice that the majority of these similarities are symbolized XXX and thusare better than 95% matches. Many are literally 100% matches. There are also a significant number of 999s, meaning 85–95% matches. Theoverall view without a hand lens shows the patterns very well. For example, the second diagram in Fig. 3 Left is for chromosome 2 left end(reverse complement), which contains the Y� repeat, and thus the pattern shows all of the right ends that also contain the Y� repeat. Thus it isclear that the left ends of chromosomes 2, 5, 6, 8, 9, 10, 12, 13, 14, and 16 contain the Y� repeat, and similarly it is present on the right ends ofchromosomes 4, 5, 7, 8, 12, 15, and 16, which have a consistent relationship to the other Y� repeats. There is a lot of variation in these patternsand the Y� repeat is merely a prominent part. Chromosome 1 is exceptional and includes the W� repeat shown on the first line of Fig. 3 as morethan 8 kb of precise reverse complement similarity between the two ends of chromosomes 1. In addition there is an extensive region of directsimilarity to chromosome 8 right end (not shown). The long and precise direct sequence relationships of the left ends are restricted to the leftends of the other chromosomes and the extensive direct sequence similarities of the right ends all occur on right ends (not shown). Directterminal region similarity data will appear on my web page, as there is not space here.

If a left end has few reverse complementary similarities to the right ends of other chromosomes it also has few direct similarities to the leftends of other chromosomes, reflecting the presence or absence of the Y� repeats. There are many precise matches, including one very extensivematch of 18.5 kb between 4 left end and 10 right end, mentioned earlier. At the very end of Fig. 3 is shown a match of nearly 20 kb betweenleft chromosome 15 (reverse complement) and right chromosome 16. These surely resulted from extensive events of recombination or copyingbetween the ends of inverted chromosome pairs. The extent to which the Y� repeats are involved in exchanges is not obvious, but it seems likelythat their precise sequence similarities are the result of both exchanges between opposite ends of different chromosomes and exchangesbetween same ends. There are several examples of extensive relationship separate from the Y� sequences. For example, left ends ofchromosomes 9 and 10 are direct copies of each other for 20 kb. A separate examination showed that the copying extends for another 2.1 kbtoward the centromere.

All of the similarities in the terminal regions are shown in Fig. 3, and the long and precise terminal sequence relationships are restricted tothe last 20 kb at each end shown there. In every case there are also examples of short and imperfect sequence similarity occurring in a variety oflocations, representing short repetitive sequences that for the most part are members of unknown families. The number of these similarities to agiven chromosome terminal region ranges from 3 to 16 except for long terminal repeats (LTRs) of mobile elements, for example onchromosome 2 and chromosome 15 left ends. In these cases delta elements are present that match about 100 other sequences. In addition, at 14kb on chromosome 15 left end is a 200-bp-long element that matches about 60 other locations and is unknown to me. These short and impreciseand LTR sequence similarities are quite distinct categories of relationship from the major long and precise similarities that are involved in thepattern of exclusively complementary relationships between the left and right terminal regions.

Just-Subtelomeric Sequences and the “X2” Repeat. In attempting to And some indication of the function of the reverse complementaryrelationship between the ends it seemed that the very terminal sequence patterns might contain clues. Terminal short sequences (2 kb) of allchromosomes were multiply aligned with CLUSTALW, with the left end sequences present as reverse complement. The resulting alignments arequite good almost to the end. The interesting result is that the just-subtelomeric sequences fall into two classes with very good sequencesimilarity within the classes but none between the classes. The first class is made up of the telomeric end of the Y� repeat. It occurs as the just-subtelomeric sequence of all of the Y�-containing chromosome ends as listed above. The second class includes the X repeat (3) but is moreextensive and thus it has been named the X2 repeat to avoid confusion with previous descriptions of X repeats. It is present on all of thechromosome ends that do not contain the Y� repeat, and Fig. 4 shows the consensus sequence for the 11 best-matching members. Mostmembers agree with about 90% accuracy with the consensus, but chromosome 1 right end matches only 62%. There are other occurrences ofthe X2 sequence, which may be important. It occurs centromeric of the Y� repeat on all of the Y�-containing chromosome ends. Thus the X2sequence occurs on all chromosome ends, although the example 6 kb in from the left end of chromosome 5 is quite short (125 nt). It is possiblethat if the X2 repeat has a function it could be carried out from either location. There are small sequences between X and Ys known as STRsequences, some combination of which is found at most ends. These probably form a part of the X2 sequence. Most of the X2 repeats that occurcentromeric of the Y� repeats are well conserved and quite similar to those in just-subtelomeric locations. The conservation of the sequences of30 of the X2 repeats cannot easily be explained by recombination because of their different locations.

The alignments of the just-subtelomeric sequences are of sufficient precision that they can be used to decide if chromo

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5909

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 71: (NAS Colloquium) Computational Biomolecular Science

FIG. 3. Partial results of 768 comparisons of 20-kb probes with the complete yeast genome. Twenty-kilobase terminalsegments of all 32 chromosome ends are compared as reverse complement with a library of the whole yeast genome dividedinto 1-kb segments, using FASTA. All of the regions where sequence similarity was recognized are indicated by numbers onthe diagrams of the 20-kb terminal regions, each symbol representing 200 bp. The numbers describe the precision of matchwith the meaning mentioned in the legend of Fig. 2. In each block the 16 chromosomes are arranged in order from top tobottom, numbered at the left. At the top of each block are listed the distance from the end of the chromosomes in kb. Thecomparisons of left end reverse complements are shown and the other comparisons will be on my web page(www.cco.caltech.edu/~rbritten). This diagram describes the similarities to 20-kb reverse complement probes with a block foreach probe, the upper left exhibiting the relationships of a probe from the left end of chromosome 1. The next block below isfor a probe from chromosome 2 left end, and this probe contains the Y� repeat, so that all of the chromosomes with a Y� attheir right ends show large regions of sequence similarity in this block. This relationship establishes the major visible patternof Fig. 3. but there are many other features. Where a probe does not contain the Y� repeat as for 1 or 3 the pattern is simplerand includes the just-subterminal similarities where the right end sequence does not contain the Y� repeat. Where the rightend sequence does contain the Y� repeat, then a small block of similarity appears at about 6 kb in from the end. where thecopy of the X2 repeat is present in these chromosomes.

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5910

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 72: (NAS Colloquium) Computational Biomolecular Science

some sequences are complete to the ends. The analysis shows that half a dozen ends are incomplete, missing nearly 200 nt from the just-subtelomeric sequences. Nevertheless, for every end enough of the just-subterminal region is present to permit clear identification of the X2repeat or the Y� repeat. Most of the missing sequence ends are from Y�-containing ends apparently due to difficulties of cloning. An alternativecould be that some other mechanism besides telomerase stabilizes the ends of the chromosomes and the listed sequences are correct, but for thepresent it seems safer to assume that the sequences are incomplete.

FIG. 4. The X2 repeat consensus. The 11 chromosome ends that include very good copies of the just-subtelomeric X2 repeatwere aligned by using CLUSTALW, allowing for the fact that left ends are reverse complements of right ends. A consensussequence was made showing all positions in uppercase where six or more matched. The X2 consensus begins where there isgeneral agreement between all examples and ends early in the telomerase simple sequence. In this 800-nt region all the 11sequences match each other with high quality, between 86% and 93%. They are in complete agreement for 63% of thepositions. The conserved core of the X repeat (3) begins at position 27 and agrees well to about position 400. The sequencehas structure but is not recognizably a tandem repeat of a simpler sequence.

DISCUSSION

The terminal regions (the last 20 kb) of the yeast chromosomes stand out by sharing blocks of sequence with each other. They share onlywith each other and not with other regions of the chromosomes, except for short less precise repeats, which apparently have an origin differentfrom the longer high precision sequence similarities. The consistent reverse complementary relationship of the left and right terminal regionssuggests that there has been a history of exchange events occurring between inverted pairs of chromosomes. Consider the third block down inFig. 3 Left, diagramming the similarities to chromosome 3 left end. There is an extensive block of nearly 7 kb of similarity to chromosome 11right end that is the product of sequence exchange or copying between inverted pairs of chromosomes. However, the similarities to all the otherchromosomes are due to the fact that the left end of chromosome 3 contains the X2 repeat. There are the just-subtelomeric similarities to theright ends of chromosomes 1, 2, 3, 6, 9, 10, 11, 13, and 14. Also there are similarities to the X2 repeats at the inner or centromeric ends of the Y� repeats on the other chromosomes. These X2 repeat similarities must be considered to be due to selection on these sequences rather thanexchange. That is because they occur in both the inner and terminal locations, and it is difficult to see how exchange could be limited to theseshort X2 repeats.

Nevertheless, all of these relationships are reverse complementary between the left and right ends. The pattern observed for chromosome 3is frequent and is shown in Fig. 3 for all cases where the probe left end sequence does not include a Y� repeat. What explains the reversecomplementary relationship between left and right ends where massive exchange between opposite ends does not seem to have occurredrecently? It seems likely that the telomerase simple sequences have a required orientation with respect to the ends and thus must be reversecomplements of each other. It is assumed that DNA synthesis proceeds outward toward the telomeres. While this is consistent with the reversecomplementary relationships of the whole terminal regions, it does not supply a reason for them except for the telomerase simple sequences.There is probably a functional reason for the orientation of the X2 sequences, having to do with specific protein bindings that are part of thecomplex control system that stabilizes the chromosome ends.

Candidate models for the exchange process are recombination by breakage and rejoining, sequence conversion, or some unknownmechanism for sequence copying and insertion. The insertions and deletions and mismatches within the similar regions shown in Fig. 1 suggestthat in many cases multiple events have occurred and that deletions and base substitutions leading to mismatch have occurred subsequent to theevents that created precise inverted sequence matches. It is likely that the process has been regenerative, because once inverted matchingsequence regions are present between opposite ends the chromosomes could be aligned by the mechanisms that permit normal recombination tooccur between matching pairs of diploid chromosomes. The existence of many near perfect (95–100%) matching regions in Figs. 1, 2, and 3suggests that the typical event has been quite recent, and only a little base substitution has occurred since. Thus the events are quite frequent onan evolutionary time scale. It is interesting that in many cases the terminal simple sequence created by telomerase is included in the sequencesimilarity. In every alignment for which both sequences extend into the telomerase simple sequences the alignment extends as far as thesequences go and remains good but not perfect. It may be due to patterns created

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5911

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 73: (NAS Colloquium) Computational Biomolecular Science

by the telomerase, but exchange has to be considered as partially responsible. These exchanges very likely influence the stability of the terminalregions of the yeast chromosomes (4).

It is clear that the known long repeats of the terminal regions—e.g., Y�, X�, and W�—are an intimate part of the process of terminal regionsequence exchange that is probably responsible for the patterns shown in Fig. 3. It seems likely that they originate in this process. That leavesopen the question as to whether all of these duplications and multiplications are simply the inevitable result of the process of terminal regionsequence exchange. Whether they have useful functions is yet uncertain, though the X2 repeat (Fig. 4) is well conserved and present on everychromosome end. There are no examples of precise and long direct sequence similarity between terminal regions on the opposite ends ofchromosomes. This finding suggests that the orientation of these sequences or part of them is important to yeast survival. These sequencespoint outwards (or inwards depending on point of view) from all yeast chromosomes. They are terminated by the simple sequences generatedby telomerase. The telomerase sequences are clearly significant to chromosome stability and replication, but there is good evidence that theycarry out other functions. Changing their length affects survival (4).

The central issue is, of course, the evolutionary role and potential function of the reverse complementary relationship of the terminalregions, but little can yet be said. Finally, it seems very unlikely that this pattern of asymmetry is restricted to yeast chromosomes. As thehuman genome project advances so that sufficient lengths of terminal regions are available it will be interesting to see how well the reversecomplementary relationship holds in our own genome. The prediction is that it will be very similar to the yeast situation with allowance fordifferent telomerase synthesized sequences and lengths and distinct sets of repeats in the subterminal regions.

The yeast chromosomal sequences were obtained from the Stanford SGD http://genome-www.stanford.edu/Saccharomyces/. Thanks to EdLouis for preprints. Johnny Williams prepared useful software in Perl language. This work was supported by National Institutes of Health grants.1. Goffeau, A., Aert, R., Agostini-Carbone, M.L., Ahmed, A., Aigle, M., Alberghina, L., Albermann, K., Albers, M., Aldea, M., Alexandraki, D., et al.

(1997) Nature (London) Suppl. 387, 5–105.2. Pryde, F.E., Gorham, H.C. & Louis, E.J. (1997) Curr. Opin. Genet. Dev. 7, 822–828.3. Louis, E.J. (1995) Yeast 11, 1553–1573.4. Zakian, V.A. (1996) Annu. Rev. Genet. 30, 141–172.5. Kramer, K.M. & Haber, J.E. (1993) Genes Dev. 7, 2345–2356.6. Wellinger, R.J., Ethier, K., Labrecque, P. & Zakian, V.A. (1996) Cell 85, 423–433.7. Louis, E.J., Naumova, E.S., Lee, A., Naumov, G. & Haber, J.E. (1994) Genetics 136, 789–802.8. Flint, J., Bates, G.P., Clark, K., Dorman, A., Willingham, D., Roe, B.A., Micklem, G., Higgs, D.R. & Louis, E.J. (1997) Hum. Mol. Genet. 6, 1305–

1314.9. Louis, E.J. & Borts, R.H. (1995) Genetics 139, 125–136.10. Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448.11. Coissac, E., Maillier, E. & Netter, P. (1997) Mol. Biol. Evol. 14, 1062–1074.

PRECISE SEQUENCE COMPLEMENTARITY BETWEEN YEAST CHROMOSOME ENDS AND TWO CLASSES OF JUST-SUBTELOMERIC SEQUENCES

5912

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 74: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95. pp. 5913–5920, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Dootittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

A unified statistical framework for sequence comparison andstructure comparison

(sequence analysis/structure analysis/fold family/database statistics/protein evolution)MICHAEL LEVITT*† AND MARK GERSTEIN‡

*Department of Structural Biology, Stanford University, Stanford. CA 94305: and ‡Molecular Biophysics and Biochemistry Department,P.O. Box 208114. Yale University. New Haven, CT 06520–8114

ABSTRACT We present an approach for assessing the significance of sequence and structure comparisons by using nearlyidentical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [takenhere from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observedscores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported bystandard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-valuedistribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distancesbetween aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atompositions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignmentscore. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structuralcomparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. Thecomparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas manypairs have significant similarity in terms of structure but not sequence.

Comparison is a most fundamental operation in biology. Measuring the similarities between “things” enables us to group them in families,cluster them in trees, and infer common ancestors and an evolutionary progression. Biological comparisons can take place at many levels, fromthat of whole organisms to that of individual molecules. We are concerned here with the comparison on the latter level, specifically, withcomparisons of individual protein sequences and structures. (For an example of systematic comparison applied to whole organisms, see refs. 1and 2.)

Our overall aim is to describe these two types of comparisons in a self-consistent, unified framework. For sequence or structurecomparison, each act of comparing one “entity” to another (that is, either comparing two sequences or two structures) involves two steps. First,the two objects are aligned optimally through the introduction of gaps in such a way as to maximize their residue-by-residue similarity. Thisoperation generates some form of total similarity score for the number of residues matched—traditionally, a percent identity for sequences oran rms for structures, although we will use other measures. Second, one has to assess the significance of this score in the context of what isknown about the proteins currently in the database.

In earlier papers, Gerstein and Levitt (3, 30) extended the work of Subbiah et al. (4) and Laurents et al. (5) and described an approach forstructural alignment in an analogous fashion to the traditional approach for sequence alignment (6–9). Like sequence alignment, this methodinvolves applying dynamic programming to a matrix of similarities between individual residues to optimize their overall correspondencethrough the introduction of gaps.

In this paper, we tackle the second of the two steps in protein comparison: assessing significance. We developed a simple empiricalapproach for calculating the significance of an alignment score based on doing an all-vs.-all comparison of the database and then curve fittingto the distribution of scores of true negatives. This allows us to express the significance of a given alignment score in terms of a P value, whichis the chance that an alignment of two randomly selected proteins would obtain this score. We applied our approach consistently to bothsequences and structures. For sequences, we could compare our fit-based P values with the differently derived statistical score from commonlyused programs such as BLAST and FASTA (10–13). The agreement we found validated our approach. For structure alignment, we followed aparallel route to derive an expression for the P value of a given alignment in terms of the structural alignment score.

Our work followed on much that recently has been done assessing the significance of sequence and structure comparison. One of themajor developments in the past few years has been the implementation of probabilistic scoring schemes (13–16). These give the significance ofa match in terms of a P value rather than an absolute, “raw” score (such as percent identity). This places scores from very different programs ina common framework and provides an obvious way to set a significance cutoff (that is, at P=<0.0001 or 0.01%). P values were first used in theBLAST family of programs, where they are derived from an analytic model for the chance of an arbitrary ungapped alignment (10, 17). P valuessubsequently have been implemented in other programs, such as FASTA and gapped BLAST by using a somewhat different formalism (13, 18, 19).

†To whom reprint requests should be addressed, e-mail: michael.levitt©stanford.edu.© 1998 by The National Academy of Sciences 0027–8424/98/955913–8S2.00/0PNAS is available online at http://www.pnas.org.Abbreviation: scop, Structural Classification of Proteins.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5913

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 75: (NAS Colloquium) Computational Biomolecular Science

There are currently many methods for structural alignment (20–31). Some of these are associated with probabilistic scoring schemes. Inparticular, one method (VAST) computes a P value for an alignment based on measuring how many secondary structure elements are aligned ascompared with the chance of aligning this many elements randomly (28). Another method (27, 32) expresses the significance of an alignmentin terms of the number of standard deviations it scores above the mean alignment score in an all-vs.-all comparison (i.e., a Z-score).

Data Sci Used for Testing. One of the most important aspects of our analysis is that we carefully tested it against the known structuralrelationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or false-positive and to decideobjectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (33–34) and definitions ofdomains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32:refs. 35–37). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequenceidentity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequenceswere what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 differentfolds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These2.107 nontrivial, pairwise relationships between the domains formed our set of true-positives.

Sequence Comparison Statistics. Sequence matching was done with standard approaches: In particular, we used the SSEARCHimplementation of the Smith-Waterman algorithm (7) [from the FASTA package, version 3, (12, 40); the URL is ftp://ftp.virginia.edu/pub/fasta],with a gap-opening penalty of –12. a gap-extension penalty of –2, and the BLOSUM50 substitution matrix [which has a maximal match score of13 (for C to C) and an average match score of –0.36].

A probability-density function for sequence-comparison scores. Each pairwise sequence comparison was best quantified by three numbers,Sseq, n, and m, where Sseq is the raw sequence alignment score and n and m are the lengths of the two sequences compared. Comparing allpossible pairs of sequences allowed us to calculate an observed probability density, ρ°seq, for the chance of finding a pair of sequences withparticular values for Sseq and ln(nm). Fig. 1A shows the density for pairs between all sequences. This includes the scores for

FIG. 1. A probability-density distribution for sequence comparison scores, contoured against Sseq, the sequencealignment score (along the horizontal axis) and ln(nm), where n and m are the lengths of the pair sequences (along thevertical axis). This density is related closely to the raw data (via normalization) obtained by counting the number of pairswith particular S and ln(nm) values. Because of the wide range of density values, contours of log are drawn with aninterval of 1 (a full order of magnitude). When contouring the logarithm of a density function, special attention must be paidto the zero values. Here, a zero value is set to 0.001, which effectively lifts the entire surface by 3 log units. The data then aresmoothed by averaging with a Gaussian function [exp(~s/(∆Sseq/3)2)] over a window 14 units wide along the Sseq axis. Thissmoothing together with the treatment of zeros serves to emphasize the smallest observed counts (values of 1) by surroundingthem with three contour levels. (A) Data from all 884,540 pairs between any one of the 941 sequences and any other sequence(pairs A-B and B-A are both included). The significant sequence matches are seen as the isolated spots at high values of thescore Sseq, (B) Data from 352.168 pairs, including only those pairs of sequences in different scop classes. We also excludepairs between an all-α or all-β domain and an α+β domain, as well as sequences that are not in one of the five main scopclasses: α, β, α/β, α+β. and α+β (multidomain). This exclusion is done to ensure that no significant matches will be found,which indeed is seen in the figure by the absence of any outlying spots at high score values. Thus, the density in B is free ofany significant matches and shows the underlying density distribution expected for comparison of unrelated sequences.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5914

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 76: (NAS Colloquium) Computational Biomolecular Science

�300 sequence pairs that are related closely, which clearly show up as “spots” on the right side of the plot. These high-scoring “true-positives”are removed in Fig. 1B, which shows the density for just the pairs in different structural classes (42), i.e., the pairs that definitely are unrelated.This is the density distribution that we aim to fit.

Fig. 2A shows the density distribution as a function of Sseq for sections at constant ln(nm). The clear linear relationship between logand Sseq at high values of Sseq is indicative of an extreme-value distribution

The variable “Z” was defined in terms of Sseq and ln(nm) by using the “Z-score-like” expression Z=(Sseq–µseq)/σseq, where µseq=a ln(nm)+band σseq= a are the most likely sequence score and width parameter for the distribution. The two adjustable parameters a and b were obtained byfitting the calculated density (Z) to the observed density (Z) for all values of Sseq and In(nm). Substituting for µseq and σseq for Z abovegave Z=(Sseq–a ln(nm)–b)/a=Sseq/a–ln(nm) –b/a.

To derive specific values for the a and b parameters, we fit the above formulas to the observed density distribution obtained by comparingpairs in different scop classes, getting a= 5.84 and b=–26.3. The fit was done by least-squares optimization by using the simplex minimizer inMATLAB (Math Works, Natick, MA). It has a residual of 0.084, which was calculated by using the standard relation r=∑ wi(Oi–Cj)2/∑ Wi(Oi)2,where i indexes “bins” with particular Sseq and ln(nm) values, is the observed density in a bin, Ci= log is the calculateddensity in a bin, wi=1/Ni is a weighting factor, Ni is the number of sequence pairs in a bin, and the summation is over all bins, I, with ln(nm)between 5.9 and 13.5.

A cumulative sequence distribution function, giving the P value. To estimate the statistical significance of a particular comparison in termsof particular Sseq, n, and m values, we needed the cumulative distribution function Pseq(z>Z), which is defined as the probability that matchingany two random sequences will give a z value greater than or equal to Z. This is just the integral of =exp(–z–exp(–z))=exp(–z) exp(–exp(–z)), from z=Z to z=, so that Pseq(z>Z)=1–exp(–exp(–Z)). Writing Z in terms of Sseq, n, and m gives

Pseq(s>Sseq)=1–exp(–exp(–Sseq/a+ln(nm)+b/a)),

where the parameters a and b are given above.Relation to BLAST P value. For sequence comparison without gaps, Karlin and Altschul (10, 11) derived the following cumulative

distribution function: PK&A(S>Sseq)=1– exp(–exp(– λ(Sseq–ln(Kmn)/λ)))=1–exp(–exp(–λ(Sseq +ln(Kmn)/λ))), where λ and K are calculatedanalytically based on the sequence composition and amino acid scoring

FIG. 2. Cross-sections of the sequence and structure density distribution show they are both extreme-value distributions andthat the calculated distribution fits the observed distribution well. (A) Plots of the logarithm of the observed, log andcalculated, log sequence pair densities against the sequence match score Sseq; log is taken from the data for pairs indifferent classes (Fig. 1B). Each panel shows the variation of the density with Sseq for a particular value of ln(nm), the productof the lengths of the sequences compared; this value is indicated by assuming n=m and showing the value of n. The observeddensity is clearly an extreme-value distribution with a linear fall-off of log with Sseq. The calculated distributionobtained with a two-parameter fit (dashed line, see text) is a good fit for all values of n [(or ln(nm)]. (B) Plots of the logarithmof the observed, log and calculated, log structure pair densities against the structure match score takenfrom the data for pairs in different classes (Fig. 4B). Each panel shows the variation of the density with Sstr for a particularvalue of the number of aligned residues, N. The calculated distribution obtained with a five-parameter fit (dashed line, seetext) is a good fit for all values of N.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5915

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 77: (NAS Colloquium) Computational Biomolecular Science

matrix. Comparison of their analytical form with our P value expression shows that λ=1/a and K=exp(b/a). Substituting the specific values for aand b that we calculated from the fit. we found that λ=0.171 and K=0.011. For the particular database sequences and amino acid scoring matrixused here, the values for λ calculated by Karlin and Altschul’s formula ranged from 0.217 to 0.259, all somewhat larger than our value for λ.

Relation to FASTA E value. In the FASTA sequence comparison programs (12, 13, 18), the significance of a given alignment score Sfa isestimated by fitting an extreme-value distribution to scores resulting from comparison of a given query sequence to each sequence in thedatabase. The distribution is recomputed for each new query so that, unlike our approach, each query sequence is associated with a differentdistribution function. This type of association has the advantage of allowing for any peculiarities of the query sequence (e.g., composition bias),but it also means that one cannot estimate the significance of a single pairwise comparison of two sequences.

The value used by FASTA in judging the significance of a sequence similarity is known as the expectation value or E value (here Efa). The Pvalue, defined above, gives the statistical significance of a single comparison whereas the E value is an estimate of the expected number offalse-positives (dissimilar matches with a significant score) for a search of the entire database. With Ndb entries in the database, the E value Eseqis calculated from our Pseq(s>Sseq) as Eseq=Ndb Pseq. The E values we obtained were very similar to those found by FASTA over a very wide rangeof values (Fig. 3). When one considers that our closed-form Eseq depends on only two parameters for all pairs whereas Efa is optimizedseparately for each query sequence (941×2=1,882 parameters in all), this agreement is astonishing.

FIG. 3. The statistical significance derived here is shown to be similar to that derived in a completely different way by thesequence comparison program SSEARCH from the FASTA package (13). We plotted the expected number of errors per search ofthe database obtained by Pearson’s method, log(Efa), against the same value calculated here, log(Eseq) (which is a function ofthe sequence match score Sseq and the length of the two sequences). To be more specific, Efa is the E value output by theFASTASSEARCH program whereas Eseq is calculated as 940Pseq(s>Sseq) for score Sseq. The accuracy of our simple two-parameterfit is confirmed by the fact that most pairs of log(Efa) and log(Eseq) values are perfectly correlated, lying along the line log(Efa)=log(Eseq) over the entire range.

Measuring coverage vs. error rate to compare different formalisms for significance-statistics. We have presented two forms of E valuestatistics for sequence comparison: our method, Eseq, which is based on fitting a two-parameter model to the observed distribution of alignmentscores; and the FASTA method Efa, which is based on fitting different distributions for each query. Now we naturally are led to ask whether thereis an objective way to decide which formalism performs the best on some representative test data.

The seminal work of Brenner et al. (39) and Brenner (43) provides a framework for such an assessment by using the known true-positivesin the scop database and a coverage-vs.-error plot. To compare any two significance-statistics formalisms, we proceeded as follows for each:

(i) For each of the pairs in the all-vs.-all comparison (941× 940 pairs), we determined an E value and noted whether the pair was a true-positive or true-negative (for true-positives, both sequences must belong to protein domains with the same fold in the scop classification). (ii)We sorted the pairs by increasing E value. (iii) We counted down the list from best to worst until the number of false-positives was 1% of thetotal number of database entries (here, this was 9 false-positives, which is �1% of 941). (iv) We got the threshold E value at this point, whichideally should be close to 0.01, so as to correspond to the 1% error rate per query. (5) Finally, we got the number of entries that were moresignificant than the threshold E value; this number defined the coverage, which should be as large as possible.

Here, we compared the coverage and error rate of our sequence score statistics with those of FASTA (Eseq vs. Efa). At the threshold E value,our sequence statistics had log Eseq= –1.98 and a coverage of 328, and the FASTA statistics had a log Efa of –1.68 and a coverage of 379. TheFASTA statistics had better coverage, but our statistics had an almost perfect threshold value, which should be –2 for 1% error rate.

Structure Comparison Statistics. The procedure we used for pairwise structural alignment is described in detail in Gerstein and Levitt(3, 30) and is summarized only briefly here. Our core method was based on iterative application of dynamic programming. As such, it was asimple application of the Needleman-Wunsch sequence alignment (6). It originally was derived from the ALIGN program of Cohen (21, 31), withmany subsequent refinements. One starts with two structures in an arbitrary orientation. Then one computes all pairwise distances betweenevery atom in the first structure and every atom in the second, which results in an interprotein distance matrix in which each entry, dij,corresponds to the distance between residue i in the first structure and residue j in the second (interresidue distances usually are expressedbetween α-carbons). This distance matrix, dij, can be converted into a similarity matrix, Sij, through the relationship Sij=M/(1+ (dij/do)2), whereM=20 and do=5Å.

One applies dynamic programming to the similarity matrix to get equivalences (using a gap opening penalty of M/2=10 and no gapextension penalty) and uses them to least-squares fit the first structure onto the second one (44). Then one repeats the procedure, finding allpairwise distances and doing dynamic programming to get new equivalences, until the process converges. After an alignment is determined, itcan be “refined” by eliminating the worst-fitting pairs of aligned residues and then refitting to get a new rms in a similar fashion to the core-finding procedure in Gerstein and Altman (45, 46). This refinement is necessary because the dynamic programming used tries to match asmany residues as possible. (It is a global, as opposed to local, method.)

The structural comparison score and the rms. At the end of the procedure, we were left with a number of scores characterizing our finalalignment. The score optimized by dynamic program

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5916

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 78: (NAS Colloquium) Computational Biomolecular Science

ming was the sum of the similarity matrix scores Sij minus the total penalty for opening gaps. We refer to this as “Sstr.” To be more explicit, itwas computed from the following formula:

Sstr=M(∑ 1/(1+ (dij/d0)2)–Ngap/2),

where Ngap is the total number of gaps (not including gaps at the end of a chain) and the summation is carried out over all pairs, ij, ofequivalenced residues. The more traditional score is the rms deviation in α-carbon position after doing a least-squares fit on the aligned atoms(the “rms”). rms-based statistics were used in our earlier work (for example, refs. 3–5) and have been used in almost all other work in structuralalignment.

A probability-density function for structural alignment scores. To derive significance-statistics for the structural alignment score Sstr, weproceeded exactly as we did for sequence comparison. Structural alignment of all pairs in the database gave us an observed probabilitydistribution for comparison scores which was a function of the number of residues matched N and the comparison score Sstr (Fig. 4A. Thisdistribution contained the many pairs of structures that were similar, and these pairs stood out with high Sstr scores. Fig. 4B shows data for pairsthat were in different scop structural classes and, therefore, should not have had any structural similarity. Fig. 4B is much “cleaner” thanFig. 4A and shows the underlying distribution expected for the comparison of structures that are not similar.

Fig. 2B shows the density distribution as a function of Sstr for sections at constant N. There is a close parallel between the structuralalignment score Sstr and the sequence alignment score, Sseq, in Fig. 2A, and both can be modeled by an extreme-value distribution. Thus, we fitthe calculated structure density by (Z)=exp(–Z–exp(–Z)), where the variable Z is defined in terms of Sstr and N by using Z=(Sstr –µstr)/σstr.The most likely structure score µstr and the width parameter σstr have a more complicated dependence on sequence length N than was the casefor sequences with µstr(N) =c ln(N)2+d ln(N)+e (if N<120), µstr(N)=a ln(N)+ b(if N≥120) and σslr(N)=f ln(N)+g(if N<120) and σstr(N)=f ln(120)+g(if N≥120).

Continuity of function values and slopes allows a and b to be written in terms of c, d, and e. To be more specific, at N=120, a ln(N)+b=c ln(N)2+d ln(N)+e and a=2c ln(N)+d. Thus, the expressions for µstr(N) and σstr(N) involve five independent parameters: c, d, e, f, and g. Wedetermined these five parameters via least-squares optimization by using the SIMPLEX minimizer in MATLAB, which yielded c=18.4, d= –4.50,e=2.64, f=21.4, g=–37.5 (a=419.3 and b= 171.8 were derived as described above). The residual was 0.288. It was given by the same formula aswas used for the residual in the sequence statistics fit with and wi=1, and the summation was over bins with anyvalue of Sstr and N between 30 and 170 residues. The resulting fit of the observed and calculated distribution (Fig. 2B) was good for all valuesof N and Sstr.

FIG. 4. The logarithm of the density distribution for structure comparison scores, is contoured against Sstr, the structuralalignment score (along the horizontal axis), and N, the number of aligned residues (along the vertical axis). By following theprotocol used for Fig. 1, the raw data obtained by counting the number of pairs with the particular Sstr and N values are“lifted” and smoothed over a window 90 units wide along the Sstr axis, and the log value is contoured in intervals of 1 logunit. Given the different scales used for Sseq and Sstr, the extent of smoothing is very similar for both. (A) Data from all 884,540 pairs between any one of the 941 sequences and any other sequence. (B) Data from 352,168 pairs, including only thosepairs of sequences in different scop classes (described in Fig. 1). Comparison of A and B shows that the true-positivestructural matches are seen in the contours at the higher values of the alignment score Sstr and also at higher values of thenumber of matches N, The density in B is free of these significant matches and shows the underlying density distributionexpected for comparison of unrelated structures.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5917

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 79: (NAS Colloquium) Computational Biomolecular Science

A cumulative structure distribution function, giving the P value. To estimate the statistical significance of a particular structure comparisonin terms of its Sstr and N values, we proceeded as we did for sequence comparison. We integrated the score distribution to determine acumulative distribution function Pstr, defined as the probability that matching two random structures will give a z value greater than or equal toZ. The structure score distribution has the same extreme-value form as the sequence score distribution, so the derivation of Pstr follows that ofPseq, with Pstr(z>Z)=1–exp[–exp(–Z)], where Z is expressed in terms of Sstr and N by using

Z=(Sstr–(c ln(N)2+d ln(N)+e))/(f ln(N)+g), N<120Z=(Sstr–(a ln(N)+b))/(f ln(120)+g), N≥120

and the seven parameters a, b, c, d, e, f, and g are given above.Structural comparison statistics based on rms. The traditional characterization of a structural alignment is in terms of the number of

residues matched, N, and the rms deviation from fitting these matched residues, R. It is convenient to focus on ln(R), which ensures that there isgood separation of values for small R. where the significant pairs occur. We calculated a probability distribution [ln(R),N] for the observedrms values of true-negative pairs in the same fashion as we did earlier for the observed distribution of structural alignment scores (Sstr,N).

The fact that log varies very slowly with ln(R) near the maximum (Fig. 5) led us to fit the calculated density by using (Z)=exp(–Z4), where Z is defined in terms of ln(R) and N as Z=(ln(R)-µrms(N))/σrms (N), with µrms(N)=c ln(N)2 +d ln(N)+e(if N<60), µrms(N)=a ln(N)+b(if N≥60) and σrms(N)=f ln(N)+g(if N<60), σrms(N)=f ln(60)+ g (if N≥60). The values of the five independent parameters c, d, e, f, and g weredetermined by least-squares optimization by using the SIMPLEX minimizer in MATLAB. which yielded c= 0.155, d=–0.619, e=1.73, f=0.0922, andg=0.212, (a= 0.872 and b=0.650 were determined as before to ensure continuity.)

FIG. 5. The fit to the structure pair density by using the rms score. The observed, log and calculated, log structurepair density distributions are plotted against the rms score ln(R) for different numbers of aligned residues. N. The observedstructure pair density, which is derived from pairs in different classes, is clearly not an extreme-value distribution because itis symmetrical about the maximum value and falls off faster than a linear function with increasing Z. In fact, it is best fit byexp(–Z4). The calculated distribution obtained with a five-parameter fit (dashed line) is a good fit when the number of alignedresidues exceeds 50.

To estimate the statistical significance of a particular comparison in terms of its R and N values, we derived a cumulative distributionfunction Prms(z>Z), defined as the probability that any z will be less than or equal to a given Z. This was just the integral of ρc

rms(z) from z=–to z=Z. Because the function exp(–z4) cannot be integrated analytically, we integrated it numerically for z from –5 to Z and tabulated its valuefor 10,000 different Z values from –5 to 5.

Comparing structure comparison statistics: Alignment score Sstr vs. rms. Once we had derived structure comparison statistics based onstructural alignment score Sstr and rms, we could compare them. The same coverage-vs.-error scheme used above to compare the two formulaefor sequence alignment significance could be used again here. When assessed in terms of coverage (number of true-positives found) at a givenerror rate on our test data, the E value statistics based on Sstr gave a much better performance (i.e., had a larger coverage) than those based onrms. To be more specific, we compared the two approaches (Estr vs. Erms) in exactly the same way that we previously had compared oursequence E value to that produced by FASTA (Eseq vs. Efa). We found that, at the 1% error threshold, the rms-based statistics have log(Erms)=–32.8 and a coverage of 202 whereas the structural-alignment score statistics have log(Estr)=–1.58 and a coverage of 627. Clearly, the statisticsbased on Sstr perform much better because the threshold is much more reliable (i.e., closer to the value of –2 for an error rate of 1%) and thetrue-positive coverage is >3-fold higher. The difference between Estr and Erms is striking and confirms that the structure score is much betterthan the rms score.

There are other reasons why the structural alignment score Sstr is a more reliable indicator than rms: (i) Sstr depends most strongly on thebest-fitting atoms whereas rms depends most on the worst-fitting atoms; (ii) Sstr penalizes gaps, whereas rms does not; and (iii) Sstr is formallyanalogous to the score one gets from a standard sequence comparison, Sseq, because both quantities are derived from a “dynamic-programming”similarity matrix. As dynamic programming finds a maximum score over many possible alignments, it is reasonable that both Sstr and Sseqshould follow an extreme value distribution. However, this is not a trivial result, as the scores are not independent, random variables whosemaximum must follow such a distribution.

Relationship Between Sequence Comparison and Structure Comparison. Having derived sequence and structure significance scoresby using all-vs.-all comparisons on the same database of 941 sequences and structures, we were in a position to compare directly structure andsequence significance scores. Fig. 6 shows such a comparison for the 2,107 pairs of proteins in our data set that are considered to be relatedevolutionarily according to scop (i.e., they are the true-positives in the same superfamily). The lines at log(Eseq)=–2 and at log(Estr)=–2 dividethe 2,107 true-positive pairs among four quadrants, depending on whether their sequence or structure matches are significant, as follows:

Top right (1,204 pairs; nonsignificant sequence match, non-significant structure match). Over half (1,204 of 2,107) of the pairs of domainsthought to be evolutionarily related by scop fall into this category of having no significant match, indicating that the combination of manualmeasures used in scop is more sensitive than either automatic sequence or structure comparison.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5918

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 80: (NAS Colloquium) Computational Biomolecular Science

FIG. 6. Comparison of structure significance with sequence significance. Plots of the structure significance. log(Estr), againstthe sequence significance, log(Eseq). for the 2,107 pairs of proteins judged to be homologous in the scop database (in the samesuperfamily). Pairs are distinguished by the extent of their structural match, with solid squares used for pairs with N≥70 andunfilled diamonds used for N<70. The horizontal and vertical dashed lines, which divide the figure into four quadrants, are atlog(Estr)=–2 and at log(Eseq)=–2, respectively. Both of these thresholds correspond to an E value of 10– 2 and P value of 10–2/941=10–5 so that we judge matches with lower values to be significant at the 1% level.

Lower left (244 pairs; significant sequence match, significant structure match). These pairs are evenly distributed in the lower leftquadrant, indicating that the sequence and structure significance scores are on the same scale.

Lower right (576 pairs: nonsignificant sequence match, significant structure match). There are many more pairs with good structurematches but without sequence matches than the converse (sequence match but no structure match). This fact objectively shows how structure isconserved more than sequence in evolution. These 576 pairs are very good test cases for threading algorithms that match a sequence to astructure, and we currently are testing them in this way.

Top left (83 pairs: significant sequence match, nonsignificant structure match). Almost all of the pairs (70 of 83) in this category involvematches with a small number of residues (N< 70). For such short matches, the structures may be deformed and may not match well. There areseven labeled pairs that are exceptions because the match is extensive (N>70), but the pairs structurally are less similar than would be expectedfrom the strong sequence match. These seven exceptions involve 11 coordinate sets. Three of these sets were solved by x-ray crystallography toonly medium resolution (>2.9 Å, lmys, 1scm. and 1tlk), five were solved by NMR (1prr. 1ntr, 2pld, 2pna, and 1tnm), and three are highresolution x-ray structures (better than 1.7 Å for losa, 3chy, and 1sha). None of the seven exceptional pairs involved two high resolutionstructures, and it seems likely that some of the seven exceptions would have had a more significant structural match if both structures in thepair were determined to a high resolution. Furthermore, as determined from consultation of a Database of Macromolecular Movements (ref. 47:see database at http://bioinfo.mbb.yale.edu/MolMovDB), some of the seven exceptions involved proteins that had been solved in differentconformational states. In particular, losa, 1mys. and 1scm involved proteins with the highly flexible calmodulin fold. These are clearlyexamples for which one would expect sequence similarity but structural differences.

DISCUSSION AND CONCLUSION

Summary. We have presented an approach for assessing in a unified statistical framework the significance of a given comparison ofproteins, whether involving sequences or structures. For either sequence or structure we fit an extreme-value distribution to the observeddistribution obtained from the all-vs.-all comparison of the database (i.e., between pairs of scop domains in different structural classes). Forsequence comparison, this extreme-value distribution is as expected: We empirically observed for gapped alignments what Karlin and Altschul(11) derived for ungapped ones. We also gave a simple formula for the E value that is likely to be useful for pairwise comparisons withoutinvolving searches of the entire database.

For structure comparison, we found that the score distribution follows an extreme-value distribution when expressed in terms of thestructural alignment score Sstr. By using this measure, expressions for statistical significance can be formulated in an almost identical way forstructure as they are for sequence. It is important to realize that, although the Sstr is produced naturally by our specific alignment method, it canbe calculated from any arbitrary structural alignment. Thus, by using our formulas, a significance can be computed from the results of anystructural alignment program. Using the more traditional rms deviation as a score does not lead to as reliable a measure of structural significance.

In connection with this, it is interesting that recent work (39, 43) indicates that the significance statistics based on optimized “sum” scoresfrom dynamic programming (i.e., Smith-Waterman scores, which are essentially sums of BLOSUM matrix

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5919

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 81: (NAS Colloquium) Computational Biomolecular Science

values minus gap penalties) perform much better than those based on the traditional measure of sequence similarity, percentage identity, whichparallels the poor performance of our structural alignment statistics based on the traditional rms. It is disconcerting that such well establishedand intuitive measures such as percentage identity or rms perform so much worse than the statistical measures based on the sequence orstructure alignment scores.

Furthermore, it is surprising that over half of the relationships between distant homologues in scop were not statistically significant (at arate of 1% error per query) using either pure sequence comparison or pure structure comparison. Almost all of the pairs found by sequencecomparison were found by structure comparison, but there were many pairs found by structure comparison that were not found by sequencecomparison. Overall, structural comparison was able to detect about twice as many of the scop distant homology superfamily pairs as sequencecomparison (at the same rate of error).

Future Directions. The approach we have used to derive statistical significance easily could be generalized to other contexts. Inparticular, it can be adapted to provide significance statistics for threading. We have not presented a detailed examination of the significancevalues for specific pairs of sequences or structures. Such an examination could prove to be a useful endeavor in the future, particularly if itfocused on pairs of proteins with the same fold but insignificant E values and those with different folds but significant E values. These twoclasses of pairs characterize the twilight zone for structure, which has yet to be described fully.

We thank S.E.Brenner for carefully reading the manuscript and S.E.Brenner and T.Hubbard for providing the pdb40d-1.32 database. M.G.acknowledges the National Science Foundation for support (Grant DB1–9723182), and M.L. acknowledges the Department of Energy (GrantDE-FG03–95ER62135).1. Rohlf, F. & Slice, D. (1990) Syst. Zool. 39, 40–59.2. Bookstein, F.L. (1991) Morphometric Tools for Landmark Data (Cambridge Univ. Press, Cambridge, U.K.).3. Gerstein, M. & Levitt, M. (1998) Protein Sci. 7, 445–456.4. Subbiah, S., Laurents, D.V. & Levitt, M. (1993) Curr. Biol 3. 141–148.5. Laurents, D.V., S. Subbiah & Levitt, M. (1994) Protein Sci. 3, 1938–1944.6. Needleman, S.B. & Wunsch, C.D. (1971) J. Mol. Biol 48, 443–453.7. Smith, T.F. & Waterman, M.S. (1981) J. Mol. Biol. 147, 195–197.8. Doolittle, R.F. (1987) Of Urft and Orfs (Univ. Sci. Books, Mill Valley, CA).9. Gribskov, M. & Devereux, J. (1992) Sequence Analysis Primer (Oxford Univ. Press, New York).10. Karlin, S. & Altschul, S.F. (1990) Proc. Natl. Acad. Sci. USA 87, 2264–2268.11. Karlin, S. & Altschul, S.F. (1993) Proc. Natl. Acad. Sci USA 90, 5873–5877.12. Lipman, D.J. & Pearson, W.R. (1985) Science 227, 1435–1441.13. Pearson, W.R. (1996) Methods Enzymol. 266, 227–259.14. Karlin, S., Bucher, P., Brendel, V. & Altschul, S.F. (1991) Annu. Rev. Biophys. Biophys. Chem. 20, 175–203.15. Altschul, S.F., Boguski, M.S., Gish. W. & Wootton, J.C. (1994) Nat. Gen. 6, 119–129.16. Bryant, S.H. & Altschul, S.F. (1995) Curr. Opin. Struct. Biol. 5, 236–244.17. Altschul, S.F. & Gish, W. (1996) Methods Enzymol. 266, 460– 480.18. Pearson, W.R. (1997) Comput. Appl. Biosci. 13, 325–332.19. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.20. Remington, S.J. & Matthews, B.W. (1980) J. Mol. Biol. 140, 77–99.21. Satow, Y., Cohen, G.H., Padlan, E.A. & Davies, D.R. (1987) J. Mol. Biol. 190, 593–604.22. Taylor, W.R. & Orengo, C.A. (1989) J. Mol. Biol. 208, 1–22.23. Artymiuk, P.J., Mitchell, E.M., Rice, D.W. & Willett, P. (1989) J. Inform. Sci. 15, 287–298.24. Sali, A. & Blundell, T.L. (1990) J. Mol. Biol. 212, 403–428.25. Vriend, G. & Sander, C (1991) Proteins 11, 52–58.26. Russell, R.B. & Barton. G.B. (1992) Proteins 14, 309–323.27. Holm, L. & Sander, C. (1993) J. Mol. Biol. 233, 123–128.28. Gibrat, J.F., Madej, T. & Bryant, S.H. (1996) Curr. Opin. Struct. Biol. 6, 377–385.29. Falicov, A. & Cohen, F.E. (1996) J. Mol. Biol. 258, 871–892.30. Gerstein, M. & Levitt. M. (1996) in Proc. Fourth Int. Conf. on Intell. Sys. Mol. Biol. (American Association for Artificial Intelligence Press. Menlo

Park, CA), pp. 59–67.31. Cohen, G.H. (1998) J. Appl. Crystallography (in press).32. Holm, L. & Sander, C. (1996) Science 273, 595–602.33. Bernstein, F. C., Koetzle, T.F., Williams, G.J., Meyer, E.E., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. & Tasumi. M. (1977) J.

Mol. Biol. 112, 535–542.34. Abola, S.J., Prilusky J & Manning, N.O. (1997) Methods Enzymol. 277, 556–571.35. Murzin, A., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540.36. Brenner, S., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642.37. Hubbard, T.J. P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239.38. Brenner, S., Hubbard, T., Murzin, A. & Chothia, C. (1995) Nature (London) 378, 140.39. Brenner, S., Chothia, C., Hubbard, T. (1998) Proc. Natl. Acad. Sci. USA (in press).40. Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448.41. Henikoff, S. & Henikoff, J.G. (1993) Proc. Natl. Acad. Sci. USA 19, 6565–6572.42. Levitt, M. & Chothia, C (1976) Nature (London) 261, 552–558.43. Brenner, S.E. (1996) Ph.D. thesis (Cambridge Univ., Cambridge, U.K.).44. Kabsch, W. (1976) Acta Cryst. A 32, 922–923.45. Gerstein, M. & Altman, R. (1995) Computer Applications in the Biosciences 11, 633–644.46. Gerstein, M. & Altman, R. (1995) J. Mol. Biol. 251, 161–175.47. Gerstein, M., Lesk, A.M. & Chothia, C. (1994) Biochemistry 33, 6739–6749.

A UNIFIED STATISTICAL FRAMEWORK FOR SEQUENCE COMPARISON AND STRUCTURE COMPARISON 5920

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 82: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5921–5928, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russel Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Folding funnels and frustration in off-lattice minimalist proteinlandscapes

HUGH NYMEYER,* ANGEL E.GARCÌA,† AND JOSÉ NELSON ONUCHIC*‡

*Department of Physics. University of California at San Diego, La Jolla, California 92093–0319; and †Theoretical Biology and BiophysicsGroup. T10 MS K710. Los Alamos National Laboratory, Los Alamos, New Mexico 87545

ABSTRACT A full quantitative understanding of the protein folding problem is now becoming possible with the help of theenergy landscape theory and the protein folding funnel concept. Good folding sequences have a landscape that resembles a roughfunnel where the energy bias towards the native state is larger than its ruggedness. Such a landscape leads not only to fast folding andstable native conformations but, more importantly, to sequences that are robust to variations in the protein environment and tosequence mutations. In this paper, an off-lattice model of sequences that fold into a β-barrel native structure is used to describe aframework that can quantitatively distinguish good and bad folders. The two sequences analyzed have the same native structure, but one of them is minimally frustrated whereas the other one exhibits a high degree of frustration.

The ability of proteins to spontaneously fold into unique three-dimensional structures has been amazing scientists for the last few decades.Since the beginning of molecular biology, it has been recognized that proteins are responsible for controlling most functions in livingorganisms, and that their functionality strongly depends on their shape. How are these biological molecules able to fold? This question has beena puzzle that has not yet been completely answered, but a lot has been learned in recent years.

Energy landscape theory and the funnel concept provide the theoretical framework towards a quantitative understanding of the foldingquestion (1, 2). This alternative view for the folding mechanism replaced the earlier idea that there must exist a single pathway for the foldingevent with clearly defined chemical intermediates (3, 4). After early seminal contributions by Go� (5), Bryngelson and Worynes realized in thelate 1980s (6, 7) that a Kill understanding of folding process would have to involve a global overview of the protein energy landscape. Inspiredby this view, Leopold and collaborators (8) introduced the concept of a funnel landscape to describe good folding sequences, a landscape thatresembles a partially rough funnel riddled with traps where the protein can transiently reside. In such a funnel there is not a unique foldingpathway but a multiplicity of folding routes, all converging towards the native state. Late in the folding process, the protein may be trapped insingle pathways but, at this stage, most of the protein has already found its correct folding configuration and the search becomes limited.Several other groups have also participated in the development of this new view that has flourished in the 1990s. Even though the following listis clearly incomplete, in addition to the previous references, the reviews in refs. 9–20 provide a detailed description of the landscape perspective.

The description that follows provides a qualitative understanding of a funnel landscape. Unlike protein-like heteropolymers, randomheteropolymers with a tendency to collapse do not have a well defined three-dimensional conformation, but a collection of completely differentlow energy structures. How can we differentiate between these two kind of sequences? Imagine that we want to discover a sequence that favorsa particular structure, called the native structure. A major task at this point is to choose a good reaction coordinate (or order parameter) thatmeasures the similarity between this native structure and any other conformation that may be adopted by this heteropolymer. For latticeminimalist models, a successful coordinate has been Q, the fraction of native tertiary contacts (9, 21–25). For real proteins many other choicesare possible and, in most cases, several of them may be necessary, such as fraction of native secondary structures and fraction of native helixcaps (1, 26). For our pictorial description we consider only a single Q, varying between 0 and 1 (native structure). As shown in Fig. 1, anideally designed folding sequence has the energy of its conformations proportional to Q plus some roughness introduced by the nonnativecontacts. This correlation between energy and structure not only introduces a bias that favors the native configuration but it also proportionallybiases all nonnative conformations, depending on their degree of similarity to the folded state. This correlation is responsible for the funnelshape of the landscape. It is important to notice that even conformations that are completely different but have similar Q (native parts aredifferent) have similar energies. A random sequence would display no such correlation between energy and structure, leading to the roughlandscape shown in Fig. 1.

For a protein-like heteropolymer to have the energy proportional to the global order parameter Q, its stabilizing contacts should be equallydistributed throughout the entire structure. All native interactions should favor folding, and they should be equally important—i.e., the systemexhibits no “frustrated” interactions. This is the ideal situation and, although real proteins may not be so perfect, they clearly need to minimizefrustration, an idea proposed by Bryngelson and Wolynes (6). Because proteins are finite systems, if they have a single ground state, there isalways a temperature below which this lowest-energy state is stable. This temperature is called the folding temperature, Tf. On the other hand,because the landscape is rugged, there is also a temperature below which the kinetics is controlled by long-lived low-energy traps

‡To whom reprint requests should be addressed, e-mail: jonuchic©ucsd.edu.© 1998 by The National Academy of Sciences 0027–8424/98/955921–8$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: MD, molecular dynamics; MFPT, mean first passage time; MODC, molecule optimal dynamic coordinates.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5921

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 83: (NAS Colloquium) Computational Biomolecular Science

and not by the bias toward the native conformation. This temperature is called the glass temperature, Tg. Minimally frustrated sequences requiresufficient bias to have the folding temperature larger than the glass temperature. Therefore this competition between energetic bias towardnative conformation and roughness is fundamental in determining the folding mechanism, and it leads to a diversity of folding scenarios thatare discussed elsewhere (2). All these ideas are further explored later in this paper.

FIG. 1. (a) Energy landscape for a random heteropolymer. Notice that the presence of low energy states that are completelydissimilar is a direct consequence of the small energy bias toward the native state relative to the roughness of the landscape,(b) Funnel-like energy landscape for a minimally frustrated heteropolymer. A clearly favored native structure can beobserved in the bottom of this funnel. Because of this dominant bias, all the other low energy states are similar to the nativeone.

Sequences with a good folding funnel not only are fast folders at temperatures around the folding temperature but, most importantly, theyare robust folders. Robustness is an essential property in biology. Minor variations in the folding environment such as small changes in pH,temperature, denaturant concentration, or, even more interesting, variations because of mutations may affect the native configuration in favor ofother low-energy structures. If these other low-energy structures are similar to the folded one, the consequences are minor. The “new” nativeconformation is very similar to the “old” one. The observed linear dependence between logarithms of the folding/unfolding rates and thefolding free energy is a direct indication that this is the case for proteins (27–31). Frustrated sequences, on the other hand, not only are slowfolders but also may have the structure of their native state drastically changed under minor variations of the conditions described above.

This diversity of scenarios suggested by the landscape theory and the funnel concept can be observed by simulations of protein folding incomputer models. Such simulations can be carried out at many different levels. Ideally they should be at the atomistic level but, because ofcomputational limitations, this approach has limited itself to insights into local aspects of folding (32, 33) and characterizing ensembles ofstates for unfolded proteins (34–37). Thus minimalist models have been of major importance in our understanding of protein folding. Latticemodels have been the center of these studies. They include the simple ones exploited in the early 1980s (5, 38, 39), and more recently in studiesby several other groups (8, 12, 15, 16, 20, 40–46). These models have really improved our present understanding of protein folding. Off-latticemodels have also been studied (47–54), but little has been done in this landscape context, making this point the focus of this paper.

In addition to simulations, new experiments have been devised to probe early folding events and to explore the landscape of small fast-folding proteins (NMR dynamic spectroscopy, protein engineering, laser-initiated folding, and ultrafast mixing; see, for example, refs. 13, 14,28, and 55–67, 85). Fast-folding proteins fold on millisecond timescales and have a single domain—i.e., they have a single, well defined,funnel (68). The combination of landscape theory, simulations, and this new family of experiments is providing the basis for a quantitativeunderstanding of the protein folding mechanism.

In this paper we show results for an off-lattice minimalist model where we explore the behavior of two folding sequences with the samenative structure, but with one containing a higher degree of frustration. A quantitative landscape framework for quantifying differences betweengood and bad folding sequences emerges from this comparison. Because most of the existent landscape analysis has been performed for latticesimulations, we present in the next section a summary of some selected results in the lattice to help with our discussion of the off-latticesimulations.

A Summary of Lattice Minimalist Models

Minimalist models of protein folding must contain all the features necessary to understand the folding mechanism. In its simplest version aheteropolymer must contain at least two kinds of monomers whose interactions obey some simplified interaction rule—i.e., heteropolymersmay be thought as a necklace of beads of two or more kinds. The question to be answered is what sequences of beads are able to fold into aunique three-dimensional structure. In an effort to mimic the hydrophobic effect, Dill and collaborators (12) proposed the first set ofinteractions, called the HP model, where the interactions between H (hydrophobic) groups are attractive and all the other ones are zero. Anotherpopular model, which is used for our simulations of 27-mers in a cubic lattice, is the one where the interactions between nearest neighbor beadsof the same color are more favorable (strong attractive interaction) than the ones between beads of different colors (weak interaction).Sequences built with two kinds of beads are called two-letter code, three kinds of beads are three-letter codes, and so on.

The low-energy states of heteropolymers composed of random sequences of two or more kinds of beads are collapsed states that try tomaximize the number of contacts between beads of the same color. The polymeric nature of the chain, however, prohibits all favorableinteractions from being satisfied simultaneously, and some contacts occur between beads of different color. These are clearly frustratedinteractions, because the polymer would rather have the maximum number of favorable interactions. Thus different low-energy states may havedifferent structures with a different set of frustrated contacts.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5922

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 84: (NAS Colloquium) Computational Biomolecular Science

The 27-mer in a cubic lattice is a nice system to simulate because, even though it is not possible to enumerate all its conformations, we canenumerate all its maximally collapsed configurations that are �103,000 3×3×3 cubes. The details of these studies can be found elsewhere (seefor example, refs. 21, 40, and 69). Investigation of several two-letter or three-letter sequences has taught us that most of the sequences are badfolders, and the good folding ones maximize the number of favorable (strong) native contacts and minimize the number of strong nonnativecontacts in unfolded conformations. This strategy maximizes the energy bias toward the native state and at the same time reduces theruggedness of the landscape, which is mostly determined by the nonnative contacts. As expected, by increasing the number of different kinds ofbeads, it becomes easier to obtain minimally frustrated sequences.

How can we quantify good folders? The simplest measure, proposed by Bryngelson and Wolynes, is to determine the folding temperature(Tt) and the glass temperature (Tg) of a sequence. The folding temperature can be easily determined, and it has been chosen as the temperaturewhere the native state is occupied 50% of the time. For good folding sequences, the protein-like heteropolymer really behaves as a two-statesystem—i.e., depending on the temperature, the protein is mostly folded or unfolded, and it is rarely found in some intermediate conformation.In this case, folding is a cooperative sharp first-order-like transition and, therefore, any quantity that is able to distinguish between these twostates can be used as a probe of the folding transition. This is not the case for bad folders, where this transition is broad and noncooperative.The discussion in the later section on signatures of folders for our off-lattice models makes this distinction clear.

How is the glass temperature identified? The situation is more problematic, but it can be clearly defined. On the basis of the fact that long-lived traps are the source of the glass transition, Socci and Onuchic provided an operational definition for the glass transition (69). If trappingwere not a problem, lowering the temperature should speed up folding because it favors collapse. As the temperature gets lowered, however,there is a point where a substantial slowdown of folding happens. This temperature has been called the kinetic glass transition and is similar tothe “thermodynamic” glass transition proposed by Bryngelson and Wolynes (2, 70).

A more sophisticated analysis has been developed recently. It has been shown that for a good folding sequence around Tf, the kinetics ofits folding event can be described as a stochastic motion of a few reaction coordinates (or order parameters) on an effective potential defined bythe free energy as function of these order parameters (7, 22, 25, 71). In the simplest possible representation, this motion can be assumed to bediffusive, with a configurational diffusion coefficient that incorporates, in an average sense, transient occupation of short-lived traps.§ In thisregime the folding event is exponential and the folding time can be estimated by using diffusive reaction rate theory (22, 72). As thetemperature gets closer to the glass temperature, this description completely breaks downs. The protein is now being caught in long-lived traps,and the folding kinetics is controlled by the escape time from these traps. Because there is a full ensemble of these times, the kinetics of thefolding event becomes nonexponential. This behavior is illustrated in Fig. 2 for a minimally frustrated three-letter code 27-mer.

Clearly, a lot has been learned about the folding mechanism by investigating these lattice models. The question is how can we use theseideas to understand folding of real proteins beyond a qualitative way. Because lattice models include only tertiary contacts, a quantitativecorrespondence between these models and real proteins needs to consider additional order parameters, particularly secondary structureformation. An attempt towards this goal has been taken by Onuchic and collaborators (21). Using an analytical theory of helix-coil transition incollapsed heteropolymers to renormalize the secondary structure, they have proposed a law of corresponding states to relate small fast-foldingproteins (around 50–60 amino acids) with lattice simulations of a minimally frustrated three-letter code 27-mer.

FIG. 2. Log-log plots [as proposed by Frauenfelder and collaborators (84)] of the distribution of folding times for a minimallyfrustrated three-letter code 27-mer. Time is shown in units of the number of Monte Carlo (MC) steps. The solid linesrepresent single-exponential fits through the data. Calculations were performed by Socci and collaborators (71). AroundTf=1.509, single exponentials, consistent with the diffusive picture, are a good representation of the data. As the temperaturesapproach the glass temperature (Tg�1), escape from long-lived traps starts to control the dynamics, leading to an stretched-exponential (power-law) behavior as expected for glass dynamics. The dashed line at T=1.12 is a double-exponential fit andthe dashed ones at T=1.00 and 0.89 are stretched-exponential fits.

This correspondence between lattice models and real proteins, however, still is very limited. To explore all possible folding scenarios,there is a need to include these additional reaction coordinates (order parameters) explicitly. The offlattice minimalist models are suited for thistask. Simple off-lattice models of proteins can have protein-like shapes with well defined secondary structural elements, as in real proteins. Inaddition, the continuum character of the configurational variables forces the unique folded state to be one basin of attraction with an entropyproportional to the volume of the basin and not a single conformation.

In this paper we show how the quantitative analysis that has been performed for lattice models to distinguish between good and badfolders can be generalized for off-lattice models. It should become clear how this framework can be used to analyze any other models,including the ones with a full atomistic description. The system analyzed here has the native conformation of a small four-strand β-barrelprotein, and it is investigated for two different sequences, a minimally frustrated one and a frustrated one. The comparison between the resultsobtained for both of them makes apparent how the landscape theory and the funnel concept can be used to quantitatively explore the folding ofprotein-like heteropolymers and even of real proteins.

The -Barrel Model

Two sequences, one minimally frustrated and one frustrated, are analyzed. Both of them are Cα protein models, 46 monomers long, whichfold into β-barrel-shaped structures but have different potentials. The first sequence, introduced by Honeycutt and Thirumalai (73), is (B)9(N)3(PB)4(N)3(B)9(N)3(PB)5P with monomers that are labeled hydrophobic (B), hydrophilic (P). or neutral (N). This model, which we refer to as theBPN model, has

§Refs. 1 and 71 provide a detailed description for this formalism, including the dependence of the glass transition on the orderparameters.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5923

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

β

Page 85: (NAS Colloquium) Computational Biomolecular Science

been studied on several other occasions (10, 49, 50), and similar α-helical models have also been studied (74).The energetics of the BPN model is described by a potential:

The van der Waals interaction is used to mimic the hydrophobic/hydrophilic character of the different monomer types. To achieve this, theS1 and S2 coefficients are chosen to create attractive interactions between all BB monomer pairs, repulsive ones between all PP and PB pairs,and only excluded volume interactions between the pairs PN, BN, and NN, BB interactions have S1=1 and S2=1, PP and PB interactions haveS1=2/3 and S2=–1, and all interactions involving N monomers have S1=1 and S2=0.¶

As becomes clear further on, this model exhibits a high degree of frustration, probably due to the long range and nonspecific character ofthe interactions. To contrast with the BPN model, we developed a minimally frustrated one. In this model only the interactions betweenmonomers that form native contacts—i.e., contacts found in the native β-barrel— are attractive. By doing that we remove the roughness createdby nonnative contacts, recovering nearly ideal folding behavior (see discussion in the introduction). We refer to this model as the Go �-likemodel because it is similar to the one introduced by Go� and collaborators (76).||

To construct the Go �-like model, we take a quenched structure from the BPN model and identify all contacts of the type i,j >i+3 within adistance of 1.167σ. This produces 47 pairs of monomers distributed mainly between the B-monomers (see Fig. 3); several of the monomers inthe turns and in one end have no contacts. All attractive van der Waals interactions between monomers are turned off except for these 47 pairs.All other pairs have only the repulsive 1/r12 term, responsible for excluded volume. The native pairs have an attractive interaction with a welldepth of ε and an energy minimum at 1.2σ. This choice of interactions results in only minor differences between the ground state structure andthe original quenched model. All bond and angle interactions are the same as in the BPN model. (There are many possible ways to construct aGo �-like model, because the choice of the number of native contacts is somewhat arbitrary. The one adopted by us is reasonable for the purposeof building a minimally frustrated sequence with this native conformation, but it is not unique.)

Already in the development of these potentials, the differing level of robustness of the two models is apparent. Although both models areweakly sensitive to changes in the angle interactions, the BPN model is very sensitive to changes in the strength of the dihedral energyinteraction, unlike the Go �-like model. Weakening of the intrinsic trans preferences in the BPN model by 25% makes the original nativestructure unstable at all temperatures. On the other hand, the dihedral preferences in the Go �-like model can be strengthened or weakened whilemaintaining the same ground state structure. Even total elimination of the backbone rotamer preferences (A=0.0ε and B=0.2ε), adopted by us inthis paper, reduces the stability by only 36%, leaving a wide temperature window between Tf and Tg.

FIG. 3. An illustration of the ground state of the Go�-like model. Each arrow represents an attractive interaction that existsbetween two monomers. There are 47 of these interactions. The only nonbonded interaction between two monomers withouta connecting arrow is a repulsive 1/r12 term responsible for excluded volume.

Signatures of Good and Bad Folders

Thermodynamics. The first clear indication of the different degrees of frustration between these models comes from analyzing theirthermodynamic properties. Similar to what is observed in lattice simulations (1, 71), minimally frustrated systems are characterized byequivalent folding pathways, and such systems have cooperative folding transitions.

Figs. 4 and 5 show the specific heat and the degree of folding

FIG. 4. The specific heat, Cv, of the BPN model (Upper) is contrasted with the collapse and folding denaturation curves(Lower). Compared to the minimally frustrated Go �-like model (see Fig. 5), it shows a reduced level of cooperativity. Noticethat collapse occurs prior to folding and that, even at the lowest temperature, the number of native contacts is far frommaximal. Reliable sampling could not be performed below 0.4ε because at these temperatures the kinetics is controlled byescape from long-lived traps. In particular, the lowest bump in the specific heat is partially an artifact of the low T sampling.

¶For both models, we work in reduced units—i.e., all units are defined in terms of the monomer mass M, the bond length σ, and theenergy ε. Time is thus measured in units of and friction in units of τ –l. Also, all bonds are fixed with the shake algorithm(75), and bond angles are set to have a rest value of 105° and a spring constant of 40ε(rad)–2. The BPN model has stiff local transpreferences for the dihedral angles except at the loop regions. Thus the BPN coefficients for the dihedral interactions are set as A=1.2εand B=0.2ε for all the dihedral interactions except those involving two or more neutral monomers, in which case, A=0.0ε and B=0.2ε,leading to a small barrier but no preference among the three possible backbone rotamers. As a consequence of this choice of dihedralcoefficients, rigid strands appear at all temperatures below the collapse temperature.

||This model is also similar to the associative memory hamiltonian used by Wolynes and collaborators (48) in the limit of a singlememory.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5924

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 86: (NAS Colloquium) Computational Biomolecular Science

(Q) and collapse (C) order parameters as a function of temperature for both models. The difference in folding cooperativity between them isnoticeable. The BPN model has a broad transition region centered at Tc�0.72ε that is mainly a collapse transition, although the collapsedstructures are rather restricted in their conformations. Nearly all the collapsed structures have a four-stranded topology like the ground state.This similarity is reflected in the increase in the folding order parameter simultaneously with the collapsed one. Fig. 6 provides strong evidencethat most of the entropy is lost upon collapse. Even though states �70% similar to the native one are formed below Tc, the native state itself isnot populated until well below this temperature. The temperature at which this occurs is not known exactly because it is below T=0.4ε,temperatures where our sampling is not reliable. Notice from Fig. 6 that for temperatures below 0.5ε, this model “runs out” of entropy at Q�0.3,indicating that its kinetics is now controlled by escape from long-lived low-energy traps (glassy regime). The structural properties of these low-energy structures is discussed in the subsection on the ground state.

FIG. 5. The specific heat. Cv, of the Go�-like model versus temperature (Upper) is contrasted with the mean values of Q and C versustemperature (Lower). Notice the simultaneity of the collapse and folding events as well as the high degree of cooperativity of the foldingtransition.

FIG. 6. The thermodynamic functions plotted as a function of the folding order parameter, Q, for the BPN model. F is the free energy, TSis the temperature times the entropy, and E is the energy. The temperatures are measured in units of ε(0.6ε is just below the collapsetemperature). All curves are in units of kBT for and are shifted relative to the native state. The lack of an energy bias toward the native state isapparent. The entropy plots also illustrate the onset of glassy behavior at temperatures below 0.5ε (model runs out of entropy at Q�0.3). Atthese low temperatures, the dynamics becomes controlled by the escape time from long-lived traps.

In contrast to the BPN model, the Go�-like model shows a single sharp peak for the specific heat centered at 0.42ε. This “latent heat”coincides with increases in Q and C, thus collapse and folding occur simultaneously at this temperature.

Even though several order parameters can monitor collapse and folding [for example, rms deviation from the native conformation,principal component analysis coordinates (77, 78), radius of gyration, secondary structure measures, and contact measures], in all our analysisC and Q are used to probe collapse and folding, respectively. Both of them have been normalized to 1 (relative to the maximum number ofcontacts in the quenched native configuration). (Of course this means that there are a few states with C>1.) For the purpose of calculating Q orC, we define contacts to exist between any two monomers with indices i and j>i+3 that are within 1.8σ of each other, even though when wedetermined the “native” contacts for the native structure a shorter cutoff is used. This flexibility allows the native contacts to fluctuate slightly.For the BPN model, we used a cutoff of 1.2ε to define native contacts, which are exactly the attractive ones in the Go �-like model. The details ofour results are relatively insensitive to the choice of cutoff for classifying contacts as native.

The thermodynamic functions for both models are plotted versus Q in Figs. 6 and 7. The curves have been shifted to have the energy,entropy, and free energy equal to zero at the native state. The Go �-like model shows a very good funnel: the energy and entropy increasesmoothly with Q. This behavior, as expected from landscape theory (1, 6–8), has also been observed in lattice simulations (22, 71). Theindividual energy and entropy terms are very large, around 10 to 100 kBT, but they almost cancel each other, yielding a much smaller residualfree energy [recall that our potentials already renormalize the effect of the solvent (2)] and, as in lattice models, a small free energy barrier of�3kBTf exists at the folding temperature. Also, since the low-energy states are all very similar to the

FIG. 7. The thermodynamic functions plotted as a function of the folding order parameter, Q, for the Go�-like model. Thetemperatures are measured in units of ε. All curves are in units of kBT for and are shifted relative to the native state. Noticethat, even for temperatures far below the folding temperature, this model does not “run out” of entropy, indicating thepresence of a very good funnel as expected for minimally frustrated systems.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5925

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 87: (NAS Colloquium) Computational Biomolecular Science

native configuration (Q�1), this system is very robust and therefore, as discussed in the introduction, insensitive to reasonable changes in theenvironment (changes in temperature or changes that affect the potential) and mutations.

The behavior of the BPN model (Fig. 6) differs sharply from that of the Go �-like model. The free energy plots indicate a noncooperativesecond-order-like collapse transition near Tc (0.72ε) with little preference among collapsed structures. The native structure is selected from alarge ensemble of dissimilar low-energy structures. Most of the energy gain is used upon collapse, leaving almost nothing to bias the searchamong collapsed states toward the native configuration. As discussed above, the entropy decreases sharply for states with Q>0.3 fortemperatures just below the collapse temperature (around 0.5ε). This entropy crisis heralds the onset of a glassy dynamics that is controlled byescape from long-lived low-energy traps. This glassy behavior is supported by three other effects that become prominent near and below thistemperature: a rapid increase in the folding time as the temperature is reduced, the existence of nonexponential relaxation, and the occurrenceof specific folding trajectories that are unrelated to the underlying free energy surface as plotted versus a few order parameters. Also, becauselow energy states may be so dissimilar, this model shows no robustness. Minor changes in the environment and mutations may cause dramaticchanges in the structure of the native state (see further discussion in the next two subsections).

Sampling for the determination of the thermodynamic behavior is done using the AMBER (79) program. Molecular dynamics (MD)simulations are performed at constant temperature (80) with a coupling time of 0.1τ and a time step of 0.005τ. Samples taken at severaltemperatures are combined by using multiple histograms (81). Simulations are done at various temperatures ranging from 0.02ε to 1.2ε. Eachtemperature simulation is preceded by a 2-million-step equilibration that starts from the final conformation of the previous higher temperaturesimulation. At each temperature 4,000 configurations are collected.

Kinetics. To fully explore the dynamics of the folding event, a series of folding simulations is performed for both models at differenttemperatures. MD simulations are done using a leap-frog Langevin integrator (adapted from ref. 82). We do measurements of kinetic quantitieswith a γ of 0.2τ–1, which is a factor of 10 larger than the measured value for amino acids in water (83). We do not believe the use of a lowerfriction constant will qualitatively change our results, although folding timescales are probably decreased by a factor of 10. Simulations of theGo �-like model for different values of the friction constant show a folding rate that varies linearly for γ greater than 2.0τ–1, and this variationappears to be temperature independent. No appreciable difference in folding behavior is noticed for the different values of γ. The samedependence is also observed for the BPN model (72).

On the order of 100 simulations are performed at each temperature. Each simulation is preceded by 200,000 simulation steps at 1.6ε tounfold and randomize the system. The final coordinates and velocities of this simulation are used as the starting point for the foldingsimulations. Q is calculated for every tenth structure, and the simulation is halted when a native structure with Q=1 is reached. The length ofthe folding run is used to calculate the mean first passage time (MFPT) for each temperature. The MFPT times increase rapidly at low and highvalues of temperature. In the BPN model, the minimum MFPT is about 900τ and occurs at 0.6ε. In the Go�-like model, the minimum MFPT isabout 100τ and occurs at 0.2ε.

The increase in the folding rate at low and high temperatures is a prediction of the energy landscape theory (2, 7). As discussed in thepreceding section and ref. 22, the increase in the MFPT at high temperatures is caused by the growth of the folding barrier, whereas theincrease at low temperatures (before the glass transition) is due to changes in the prefactor of the folding rate, which depends on aconfigurational diffusion coefficient that averages the effect of short-lived traps.

FIG. 8. Log-log plots of the unfolded population as a function of time for the BPN model (Upper) and the Go�-like model(Lower). The dashed lines are exponential fits to the data, and the single solid line is a power-law fit for the BPN model.From Upper, we can notice that the BPN model starts to become nonexponential at temperatures just below collapse (aroundT=0.6ε). Deviations from single-exponential behavior are caused by a few deep traps with different escape times. Around thistemperature, the kinetics is roughly bi-exponential. As the temperature gets lower, the number of these low-energy trapsincreases substantially, leading to the power-law decay. The onset of nonexponential kinetics in the Go �-like model does notoccur until temperatures much lower than the folding temperature (around T = 0.1ε). All simulations are truncated at 50,000 τ.

Similar to lattice models (see the preceding section and ref. 69), a simple way to estimate the glass transition is to use the operationaldefinition of a kinetic glass transition temperature Tg, the temperature at which the MFPT for folding has fallen to 1% of its maximal value. Theapproximate value of Tg for the BPN model is 0.4ε, and for the Go �-like model it is 0.05ε.** This gives for the two models a Tf/Tg ratio of about0.9 and 8. respectively. These ratios place the BPN and Go�-like models squarely in the groups of strongly and minimally frustrated systems.

A hallmark of glassy dynamics is nonexponential relaxation. As in Fig. 2, Fig. 8 shows log-log plots of the unfolded population as afunction of time for both sequences. In these plots, an exponentially decaying population falls sharply, whereas glassy dynamics exhibits apower-law (or stretched exponential) decay (71, 84). The BPN model starts to deviate from exponential folding around 0.6ε, where the decay isbi-exponential. This is evidence that the system is starting to be trapped in nonnative conformations. At 0.45ε there is a continuum of foldingtimes, controlled by the escape times from a large ensemble of long-lived low-energy traps. This is reflected in a power-law decay with foldingtimes ranging from 500τ to at least 50,000τ, the time limit for individual folding simulations. The second relaxation time, at temperatures wherethe kinetics is roughly bi-exponential, is most likely caused by a trap in which the first completely hydrophobic strand is bent backwards tocontact itself. Although there are several unfavorable dihedrals in this conformation, the large number of BB contacts makes it an exceptionallylow-energy trap. On the other hand, the Go�-like model decay can be fit by

**The ruggedness for the Go �-like model is very small because the energy is roughly proportional to Q. This is apparent from Fig. 7,where the entropy as a function of Q is almost temperature independent.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5926

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 88: (NAS Colloquium) Computational Biomolecular Science

a single exponential for temperatures much lower than the folding temperature, all the way down to �0.1ε. Therefore, no long-lived traps existsfor the relevant temperatures around Tf. The lack of folding events in either system within the first 10–20τ is due to the intrinsic collapse time;systems that fold in this time are collapsing directly into the native structure.

Single folding runs of the BPN model (T�0.5ε) show long-lived traps that are not visible from plots of the potential of mean force. Thesetrapped trajectories individually show little relation to this effective potential. This behavior becomes prominent near and below Tg because thefolding kinetics is then controlled by escape from low-energy long-lived traps.

The Nature of the Ground State. The inherent frustration of the BPN model compared to the Go�-like one can be visualized by measuringthe occupation of the different collapsed states. Using MD trajectories of both models, we perform a cluster analysis of the collapsed states interms of collective motions that best (in a least-square sense) represent the system fluctuations. These coordinates are called molecule optimaldynamic coordinates (MODC) (77, 78). The MODCs are obtained by diagonalizing the covariance matrix of selected dynamic variables (in ourcase, the Cartesian coordinates for the sequence beads). The largest eigenvalue MODC best describes the atomic fluctuations and, in this case,is sufficient to differentiate the various long-lived low-energy traps.

In Fig. 9 Upper, a low temperature trajectory of the BPN model is plotted, using the two primary MODCs for this trajectory. Fig. 9 Lowershows the free energy as a function of the primary MODC. Superimposed in Lower is the free energy of the Go-like model, when the sameMODCs are used, showing that only the native cluster is occupied. The Go �-like model trajectory, not shown here, occupies only the nativecluster instead of the ensemble of different structures occupied by the BPN one.

Notice that each cluster does not necessarily correspond to a single structure. The rms deviation between structures in different clusters isabout 1σ and within a single cluster is less than 1/2σ, whereas crystallographic structures of proteins have backbone rms deviations of about 1/3the typical Cα–Cα distance—i.e., 1/3σ. Also, structures in different clusters have different packing arrangements of the hydrophobic monomers.Therefore, each cluster corresponds to one or a few different packing arrangements. Most differ by a combined longitudinal translation and180° rotation of one or more of the strands, and inteconversion among them involves “reptation-like” moves (53).

FIG. 9. Cluster analysis of a low temperature trajectory (T� 0.32ε) for both models. Upper plots a trajectory for the BPNmodel as function of the first two MODCs, and it shows that multiple clusters are often occupied. A trajectory for the Go �-likemodel, not shown here, mostly occupies the native cluster. Supporting these observations, Lower shows the potential of meanforce (PMF) for both models as a function of the first MODC. Each minimum corresponds to a different cluster. While thePMF for the BPN model has several low-energy minima, the Go�-like PMF has a single, well-defined, minimum at the nativecluster.

Concluding Remarks

A framework based on the energy landscape theory and the funnel concept, which is able to quantitatively estimate the degree offrustration of folding sequences, has been presented. Thermodynamic and kinetic measures are used to distinguish between good folders(minimally frustrated) and bad folders (frustrated). Good folding sequences have a weakly rugged funnel-like landscape with low energy statesthat have structurally similar configurations. The folding kinetics is exponential for temperatures around Tf, and the system is very robust toreasonable changes in the environment and mutations. The situation reverses for frustrated sequences. The landscape is rugged and the low-energy states are dissimilar. Around Tf, the kinetics is controlled by escape from different low-energy traps and therefore is nonexponential.The robustness observed for good folding sequences becomes nonexistent.

Also, a comparison between two sequences that fold into the same native conformation, one frustrated and one minimally frustrated, hasbeen presented as an application of this framework. Notice, however, that the landscape theory predicts a diversity of folding scenarios thatcannot be discussed by a single example. Even though different order parameters may be necessary to describe different systems and theirrespective folding scenarios, this framework will apply for all of them. By departing from the minimalist lattice models and moving to off-lattice ones, we can now develop a much richer collection of folding models and understand the folding conditions for each of them. Inaddition, this framework is not limited to minimalist models. It can be applied for folding of proteins at full atomistic representation. At thislevel the kinetic data will be very limited, but the thermodynamic analysis alone is already very informative. By comparing these results withthe ones obtained for the minimalist models, we should be able to identify the possible folding scenarios and quantitatively understand thefolding mechanism for real proteins at an atomic resolution.

We thank Nick Socci, Gerhard Hummer, Jorge Chahine, Peter Wolynes, Joan Shea, and Charlie Brooks for helpful discussions. This workwas supported by the National Science Foundation (Grant MCB-9603839). It was also partially supported by Los Alamos/ University ofCalifornia directed research and development (UCDRD) funds and by molecular biophysics training grant (NIH T32 GN08326) for H.N.1. Onuchic, J.N., Luthey-Schulten, Z. & Wolynes, P.G. (1997) Annu. Rev. Phys. Chem. 48, 545–600.2. Bryngelson, J.D., Onuchic, J.N., Socci, N.D. & Wolynes, P.G. (1995) Proteins Struct. Funct. Genet. 21, 167–195.3. Englander, S.W. & Mayne, L. (1992) Annu. Rev. Biophys. Biomol. Struct. 21, 243–265.4. Kim, P.S. & Baldwin, R.L. (1990) Annu. Rev. Biochem. 59, 631–660.5. Go�, N. (1983) J. Stat. Phys. 30, 413–423.6. Bryngelson, J.D. & Wolynes, P.G. (1987) Proc. Natl. Acad. Sci, USA 84, 7524–7528.7. Bryngelson, J.D. & Wolynes, P.G. (1989) J. Phys. Chem. 93, 6902–6915.8. Leopold, P.E., Montal, M. & Onuchic, J.N. (1992) Proc. Natl. Acad. Sci. USA 89, 8721–8725.9. Dill, K.A. & Chan, H.S. (1997) Natl. Struct. Biol. 4, 10–19.10. Guo, Z.Y. & Thirumalai, D. (1995) Biopolymers 36, 83–102.11. Garel, T., Orland, H. & Thirumalai, D. (1996) in Recent Developments in Theoretical Studies of Proteins, ed. Elber.R. (World Scientific, Singapore),

pp. 197–268.12. Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M., Yee, D.P., Thomas, P.D. & Chan, H.S. (1995) Protein Sci. 4, 561–602.13. Fersht, A.R. (1997) Curr. Opin. Struct. Biol. 7, 3–9.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5927

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 89: (NAS Colloquium) Computational Biomolecular Science

14. Eaton, W.A., Munoz, V., Thompson, P., Chan, C.K. & Hofrichter, J. (1997) Curr. Opin. Struct. Biol. 7, 10–14.15. Mirny, L.A., Abkevich, V. & Shakhnovich, E.I. (1996) Folding Design 1, 103–116.16. Sail, A., Shakhnovich, E. & Karplus, M. (1994) J. Mol. Biol. 235, 1614–1636.17. Schcraga, H.A. (1992) Protein Sci. 1, 691–693.18. Honig, B. & Cohen, F.E. (1996) Folding Design 1, R17–R20.19. Zwanzig, R. (1995) Proc. Natl. Acad. Sci. USA 92, 9801–9804.20. Pande, V.S., Grosberg, A.Y. & Tanaka, T. (1994) Proc. Natl. Acad. Sci. USA 91, 12972–12975.21. Onuchic, J.N., Wolynes, P.G., Luthey-Schulten, Z. & Socci, N.D. (1995) Proc. Natl. Acad. Sci. USA 92, 3626–3630.22. Socci, N.D., Onuchic, J.N. & Wolynes, P.G. (1996) J. Chem. Phys. 104, 5860–5868.23. Socci, N.D., Nymeyer, H. & Wolynes, P.G. (1997) Physica D 107, 366–382.24. Guo, Z., Brooks, C. & Bockzo, E. (1997) Proc. Natl. Acad. Sci. USA 94, 10161–10166.25. Plotkin, S.S. & Wolynes, P.G. (1998) Phys. Rev. Lett., in press.26. Saven, J.G. & Wolynes, P.G. (1996) J. Mol. Biol. 257, 199–216.27. Wolynes, P.G., Schulten, Z.L. & Onuchic, J. (1996) Chem. Biol. 3, 415–432.28. Riddle, D.S., Santiago, J.V., Bray, S.T., Doshi, N., Grantcharova, V., Yi, Q. & Baker, D. (1997) Nat. Struct. Biol. 4, 805–809.29. Scalley, M.L. & Baker, D. (1997) Proc. Natl. Acad. Sci. USA 494, 10636–10640.30. Mines, G.A., Pascher, T., Lee, S.C., Winkler, J.R. & Gray, H. (1996) Chem. Biol. 3, 491–497.31. Itzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995) J. Mol. Biol. 254, 260–288.32. Hirst, J.D. & Brooks, C.L. (1995) Biochemistry 34, 7614–7621.33. Simmerling, C. & Elber, R. (1994) J. Am. Chem. Soc. 116, 2534–2547.34. Boczko, E.M. & Brooks, C.L. (1995) Science 269, 393–396.35. Daggett, V. & Levitt, M. (1993) J. Mol. Biol. 232, 600–619.36. Hünenberger, P.H., Mark, A.E. & van Gunsteren, W.F. (1995) Proteins 21, 196–213.37. Hansmann, U.H.E. & Okamoto, Y. (1993) J. Comput. Chem. 14, 1333–1338.38. Miyazawa, S. & Jernigan, R.L. (1985) Macromolecules 218, 534–552.39. Covell, D.G. & Jernigan, R.L. (1990) Biochemistry 29, 3287– 329440. Socci, N.D. & Onuchic, J.N. (1995) J. Chem. Phys. 103, 4732– 4744.41. Hao, M.-H. & Scheraga, H.A. (1994) J. Phys. Chem. 98, 4940– 4948.42. Camacho, C.J. & Thirumalai, D. (1993) Phys. Rev. Lett. 71, 2505–2508.43. Govindarajan, S. & Goldstein, R.A. (1996) Proc. Natl. Acad. Sci. USA 93, 3341–3345.44. Reva, B.A., Finkelstein, A.V., Rykunov, D.S. & Olson, A.J. (1996) Proteins 26, 1–8.45. de Araújo, A.F.P. & Pochapsky, T.C. (1996) Folding Design 1, 299–314.46. Shrivastava, I., Vishveshwara, S., Cieplak, M., Maritan, A. & Banavar, J.R. (1995) Proc. Natl. Acad. Sci. USA 92, 9206–9209.47. Levitt, M. & Warshel, A. (1975) Nature (London) 253, 694–698.48. Friedrichs, M.S., Goldstein, R.A. & Wolynes, P.G. (1991) J. Mol. Biol. 222, 1013–1034.49. Guo, Z., Thirumalai, D. & Honeycutt, J.D. (1992) J. Chem. Phys. 97, 525–535.50. Guo, Z. & Brooks. C.L., III. (1997) Biopolymers 42, 745–757.51. Sasai, M. (1995) Proc. Natl. Acad. Sci. USA 92, 8438–8442.52. Irbäck, A. & Potthast, F. (1995) J. Chem. Phys. 103, 10298–10305.53. Berry, R.S., Elmaci, N., Rose, J.P. & Vekhter, B. (1997) Proc. Natl. Acad. Sci. USA 94, 9520–9524.54. Nelson, E.D., Eyck, L.T. & Onuchic, J.N. (1997) Phys. Rev. Lett. 79, 3534–3537.55. Burton, R.E., Huang, G.S., Daugherty, M.A., Calderone, T.L. & Oas, T.G. (1997) Nat. Struct. Biol. 4, 305–310.56. Elove, G.A., Bhuyan, A.K. & Roder, H. (1994) Biochemistry 33, 6925–6935.57. Jennings, P. & Wright, P. (1993) Science 262, 892–896.58. Plaxco, K.W. & Dobson, C.M. (1996) Curr. Opin. Struct. Biol. 6, 630–636.59. López-Hernández, E. & Serrano, L. (1996) Folding Design 1, 43–55.60. Sosnick, T.R., Mayne, L. & Englander, S.W. (1996) Proteins 24, 413–426.61. Ballew, R.M., Sabelko, J. & Gruebele, M. (1996) Nat. Struct. Biol. 3, 923–926.62. Phillips, C M., Mizutani, Y. & Hochstrasser, R.M. (1995) Proc. Natl. Acad. Sci. USA 92, 7292–7296.63. Williams, S., Causgrove, T.P., Gilmanshin, R., Fang, K.S., Callender, R.H., Woodruff, W.R & Dyer, R.B. (1996) Biochemistry 35, 691–697.64. Mathews, C.R. (1993) Annu. Rev. Biochem. 62, 653–683.65. Cordes, M.H.J., Davidson, A.R. & Sauer, R.T. (1996) Curr. Opin. Struct. Biol. 6, 3–10.66. Raschke, T.M. & Marqusee, S. (1997) Nat. Struct. Biol. 4, 298–304.67. Lin, L., Pinker, R.J., Forde, K., Rose, G.D. & Kallenbach, N.R. (1994) Nat. Struct. Biol. 1, 447–452.68. Wolynes, P.G., Onuchic, J.N. & Thirumalai, D. (1995) Science 267, 1619–1620.69. Socci, N.D. & Onuchic, J.N. (1994) J. Chem. Phys. 101, 1519– 1528.70. Wang, J., Onuchic, J. & Wolynes, P.G. (1996) Phys. Rev. Lett. 76, 4861–4864.71. Socci, N.D., Onuchic, J.N. & Wolynes, P.G. (1998) Proteins Struct. Funct. Genet., in press.72. Klimov, D.K. & Thirumalai, D. (1997) Phys. Rev. Lett. 79, 317–320.73. Honeycutt, J.D. & Thirumalai, D. (1992) Biopolymers 32, 695– 709.74. Guo, Z. & Thirumalai, D. (1996) J. Mol. Biol. 263, 323–343.75. Ryckaert, J.P., Ciccotti, G. & Berendsen, H.J.C (1977) J. Comput. Physiol. 23, 327–341.76. Ueda, Y., Taketomi, H. & Go�, N. (1978) Biopolymers 17, 1531–1548.77. García, A.E. (1992) Phys. Rev. Lett. 68, 2696–2699.78. García, A.E., Hummer. G., Blumfield, R. & Krumhansl, J.A. (1997) Physica D 107, 225–239.79. Pearlman, D.A., Case, D.A., Caldwell, J.W., Ross, W.S., Cheatham, T.E., III, Ferguson, D.M., Seibel, G.L., Singh, U.C., Weiner, P. & Kollman, P.

(1995) AMBER, version 4.1 (Univ. of California, San Francisco).80. Berendsen, H.J. C., Postma, J.P.M., van Gunsteren, W.F., DiNola, A. & Haak, J.R. (1984) J. Chem. Phys. 81, 3684–3690.81. Ferrenberg, A.M. & Swendsen, R.H. (1989) Phys. Rev. Lett. 63, 1195–1198.82. van Gunsteren, W.F. & Berendsen, H.J.C. (1982) Mol. Phys. 45, 637–647.83. Lide, D.R., ed. (1994) Handbook of Chemistry and Physics (CRC, Boca Raton, FL), 75th Ed., pp. 6–253.84. Frauenfelder, H., Parak, F. & Young, R.D. (1988) Annu. Rev. Biophys. Biophys. Chem. 17, 451–479.85. Grantcharova, V. & Baker, D. (1997) Biochemistry 36, 15685– 15692.

FOLDING FUNNELS AND FRUSTRATION IN OFF-LATTICE MINIMALIST PROTEIN LANDSCAPES 5928

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 90: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5929–5934, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Optimizing the stability of single-chain proteins by linker lengthand composition mutagenesis

CLIFFORD R.ROBINSON* AND ROBERT T.SAUER†

Department of Biology, Massachusetts Institute of Technology, Cambridge MA 02139ABSTRACT Linker length and composition were varied in libraries of single-chain Arc represser, resulting in proteins with

effective concentrations ranging over six orders of magnitude (10 µM-10 M). Linkers of 11 residues or more were required forbiological activity. Equilibrium stability varied substantially with linker length, reaching a maximum for glycine-rich linkerscontaining 19 residues. The effects of linker length on equilibrium stability arise from significant and sometimes opposing changes infolding and unfolding kinetics. By fixing the linker length at 19 residues and varying the ratio of Ala/Gly or Ser/Gly in a 16-residue-randomized region, the effects of linker flexibility were examined. In these libraries, composition rather than sequence appears todetermine stability. Maximum stability in the Ala/Gly library was observed for a protein containing 11 alanines and five glycines in therandomized region of the linker. In the Ser/Gly library, the most stable protein had seven serines and nine glycines in this region.Analysis of folding and unfolding rates suggests that alanine acts largely by accelerating folding, whereas serine acts predominantly toslow unfolding. These results demonstrate an important role for linker design in determining the stability and folding kinetics of single-chain proteins and suggest strategies for optimizing these parameters.

The construction of single-chain or hybrid proteins is a potentially powerful method for generating proteins with novel functions andimproved properties (1–11). A critical element in such efforts is the design of the peptide linkers that serve to connect different protein domainsor subunits. Designed linkers are usually glycine-based peptides with lengths calculated to span the minimum distance between the C terminusof one subunit or domain and the N terminus of the next. How important is linker design in determining the properties of single-chain proteins?Alterations in linker regions have been found to affect the stability, oligomeric state, proteolytic resistance, and solubility of single-chainproteins (12–23), but few systematic investigations of these relationships have been reported. Here, we test the effects of linker design on thestability, protein folding kinetics, and biological activity of single-chain Arc represser. Wild-type Arc is a dimer with identical subunits. andArc-L1-Arc is a single-chain variant with a 15-residue linker connecting the subunits (see Fig. 1). The L1 linker of Arc-L1-Arc holds thesubunits at an effective concentration (Ceff) of 3 mM. By varying linker length and composition, we have isolated single-chain variants witheffective subunit concentrations ranging from 10 µM to 10 M, corresponding to changes in the free energy of unfolding (∆Gu) from 3 to 11 kcal/mol. These differences in stability arise from changes in the folding and unfolding rates, suggesting that linker design can affect proteinstability by altering the free energies of both the native and denatured states.

MATERIALS AND METHODS

Cassettes coding for glycine-rich linkers ranging from 3 to 59 residues (Fig. 3A) were synthesized using an Applied Biosystems 381ADNA synthesizer and were purified as described (9). A precursor plasmid (pLA3), constructed to facilitate subcloning of linker librarycassettes, contains tandem arc genes connected by a GGT ACC GGT adapter, which encodes Gly-Thr-Gly and contains unique KpnI and AgeIrestriction sites. Cassette libraries coding for 19-residue linkers with different amounts of Gly or Ala were constructed by synthesizing anoligonucleotide, which formed a hairpin:

AAA5�-ACACCTTGAGGTACCCGA (GSA) 15 GGTACCTAACAGGCG A3�-CCATGGATTGTCCGC AAAAThe underlined sequences are KpnI sites. S represents a mixture of G and C, and thus, the GSA codons encode either glycine (GGA) or

alanine (GCA). Three otherwise identical oligonucleotides with different G/C ratios at the randomized positions (1:1; 3:1; 1:3) weresynthesized to facilitate identification of a wide range of compositions. A cassette library encoding random combinations of glycine (GGT) andserine (AGT) was constructed in the same manner. Second strand synthesis was carried out using Sequenase v.2.0 (United States Biochemical)for 2 h at 37°C in Sequenase buffer containing 1 mM dNTPs. Cassettes were digested with KpnI and ligated to the KpnI backbone of pLA3.Following transformation into Escherichia coli strain HB101, colonies were picked randomly and the appropriate region of the single-chain arcgene was sequenced using the dideoxy method. Plasmid DNA encoding in-frame constructs were transformed into E.coli strain UA2F forassays of activity in vivo (24) and into E.coli X90–λO cells for protein expression.

All single-chain Arc proteins contained a (His)6 tail to facilitate purification using Ni-nitrilotriacetic acid chromatography. Proteinpurification, fluorescence and circular dichroism (CD) spectroscopy, analytical ultracentrifugation, and gel mobility-shift assays wereperformed as described (9, 25). Protein stability was assayed by urea denaturation by following changes in intrinsic tryptophan fluorescenceintensity at 337 nm or CD ellipticity at 234 nm. For these experiments, the protein concentration was 10 µM in buffer containing 50 mM

*Present address: 3-Dimensional Pharmaceuticals, Exton, PA.†To whom reprint requests should be addressed. e-mail: [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955929–6$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: Ccff, effective concentration; CD, circular dichroism.

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5929

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 91: (NAS Colloquium) Computational Biomolecular Science

Tris·HCl (pH 7.5 at 25°C), 250 mM KCl, and 0.1 mM EDTA (26). Values of ∆Gu and m were obtained by fitting denaturation data to a two-state model by nonlinear least squares methods (26). Effective concentrations were calculated by using the equation Ceff=exp[(m2` ∆G1/m1–∆G2)/RT], where m1 and ∆G1 are values for the single-chain protein, and m2 and ∆G2 are values for wild-type Arc (1.48 kcal/mol`M and 10.3kcal/mol, respectively) (26). Stopped-flow kinetic experiments of protein folding and unfolding were monitored by changes in fluorescence atprotein concentrations between 1 and 10 µM in the buffer used for stability measurements (26). Unfolding was initiated by urea-jumpexperiments (mixing ratio 1:10) to yield a final urea concentration of 7 or 9.1 M. Refolding was initiated by mixing protein denatured in 6.0–9.6 M urea with low urea buffer (1:5 ratio) to yield final urea concentrations between 1.0 and 4.5 M. Rate constants were obtained by fitting thekinetic data to single exponentials. In all cases, the residuals of the fits were distributed randomly. For ease of comparison among each libraryof variants, rates were either measured at a single urea concentration or measured at a series of urea concentrations and extrapolated to thisreference concentration by using linear regression of ln(k) vs. [urea] plots (R>0.99).

FIG. 1. (A) Tandem copies of the arc gene connected by DNA encoding a linker region comprise the gene for single-chainArc represser. (B) One model of how a linker might connect the two subunits (colored gray and white) of single-chain Arc.The positions of the N and C termini are indicated. Prepared using MOLSCRIPT (34) and coordinates of wild-type Arc (33).

RESULTS

Variation of Linker Length. A library of single-chain arc genes with linkers composed of Gly, Ser, and Thr and lengths varying from 3to 59 aa was constructed (Fig. 3A). The fraction of Gly in different linkers ranges from 66 to 80%. The linkers and corresponding proteins arenamed LLX and Arc-LLX-Arc (Length Library, X=number of residues), respectively. No intracellular expression of the Arc-LL8-Arc proteinwas detected. Arc-LL3-Arc expressed to high levels but monomers, dimers, and higher-order oligomers were observed following SDSelectrophoresis and Western analysis. This behavior may indicate “cross-folding” as has been observed with single-chain antibodies that havevery short linkers (27, 28). The remaining 13 proteins in this library were all expressed at high levels and electrophoresed as monomers. TheArc-LLX-Arc variants were tested for repression of transcription of the Pant promoter in E.coli strain UA2F, using resistance to streptomycin asan assay of biological activity (24). Arc-LLX-Arc proteins with linkers containing 13 or more residues had wild-type activities. Arc-LL11-Arcwas partially active; single-chain molecules with the LL3, LL8, and LL9 linkers were inactive. Modeling studies show that connecting the Arcsubunits with linkers shorter than 13 residues would either require the linker to cross the DNA-binding surface of the protein and/or requiredistortion of the structure.

Single-chain Arcs with linkers LL9–LL59 were purified for biophysical characterization. All of these single-chain proteins had CD andfluorescence spectra similar to wild-type Arc. Arc-LL11-Arc, Arc-LL19-Arc, and Arc-LL31-Arc were analyzed by analyticalultracentrifugation and found to be monomeric at concentrations between 10 and 100 µM (data not shown). Proteins containing the threelongest linkers (LL47, LL51. and UL59) tended to precipitate at concentrations >100 µM, possibly because of aggregation caused by cross-folding of the Arc subunits.

The thermodynamic stabilities of Arc-LLX-Arc proteins with linkers from 9 to 57 residues were determined by urea denaturation studies,revealing that the 19-residue linker provides maximal stability. As shown in Fig. 2 for a subset of these proteins, there are large changes in theconcentration of urea required for denaturation of proteins with different linker lengths, but the curves are roughly parallel indicating that thedenaturant m-values (variation of ∆Gu with urea) are similar. Fig. 3B shows the variations of ∆Gu and Ceff with linker length. For linkers from 9to 19 residues, stability of the single-chain protein increased with length. Arc-L9-Arc was the least stable (∆Gu�3 kcal/mol; Ceff�6 µM) andArc-LL19-Arc was the most stable (∆Gu=8.4 kcal/mol; Ceff=80mM) of the proteins examined. Increases in linker length past 19 residuesresulted in decreasing stability until a plateau was reached at �4.5 kcal/mol (Ceff �150 µM) for linkers between 47 and 59 residues.

The linker-dependent changes in stability arise from changes in both the folding and unfolding rates, as measured in urea-jump, stopped-flow, kinetic experiments. Fig. 3 C and D show that both the folding and unfolding rate constants vary significantly as the linker length ischanged. In 7 M urea, Arc-LL9-Arc unfolds with a rate constant (ku) of �3,000 s–1. As the linker length is increased from 9 to 19, there is aroughly exponential decrease in ku that spans 3–4 orders of magnitude and reaches a value of �1s–1 for Arc-LL19-Arc. Changes in linker lengthbetween 19 and 59 residues do not change ku appreciably. Thus, linkers shorter than 19 residues reduce the

FIG. 2. Linker length has large effects on the stability of single-chain Arc to urea denaturation. The sequences of linkers LL9(`), LL11 (∆), LL17 (`) , LL19 (�), LL31 ( ` ) , and LL47 ( ` ) are listed in Fig. 3A. Fraction unfolded was calculated by fittingplots of CD ellipticity (234 nm) vs. urea concentration to a two-state-unfolding transition. The solid lines represent the besttheoretical fits of the experimental data.

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5930

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 92: (NAS Colloquium) Computational Biomolecular Science

free energy barrier between the native state and the transition state.

FIG. 3. Properties of linker-length variants of single-chain Arc. (A) Linker sequences. (B) Equilibrium stability and effectiveconcentration vary with linker length. Error bars indicate one SD from three independent experiments. (C) Folding rates in 2M urea. (D) Unfolding rates in 7 M urea. Experimental conditions: protein 1–10 µM, 25°C, 50 mM Tris·HCl (pH 7.5), 250mM KCl, and 0.1 mM EDTA.

The refolding rate (kf) in 2 M urea has a maximum value of �1000 s–1 for the Arc-LL13-Arc protein. Decreasing the linker by fourresidues to a length of nine causes a 30-fold decrease in the folding rate. As the linker length is increased from 13 to 47 residues, the refoldingrate also decreases. Over this range, there is a roughly exponential decrease in kf that spans nearly four orders of magnitude. Little change in kfis seen for linkers between 47 and 59 residues. These results show that linker length can have large effects on the free energy differencebetween the denatured state and the transition state. Moreover, the length optima for equilibrium stability (19 residues), refolding (13 residues),and unfolding (19–59 residues) are different. The 19-residue linker provides the greatest equilibrium stability because it is the best compromisebetween reasonably fast refolding and slow unfolding.

Effects of Linker Composition. To asses the effects of varying the number of glycines in the linker, the length of the linker was fixed at19 residues and 16 internal positions were randomized between Ala and Gly (ALX library) or between Ser and Gly (SLX library) by using thestrategy described in Materials and Methods. For these experiments, the libraries were first selected for Arc represser activity in vivo and thenthe sequences of individual members were determined. Sixteen proteins comprise the ALX library; the linkers in these proteins contain from 3to 15 alanines (Fig. 4A). Ten proteins, with 3–11 serines in the linker region, comprise the SLX library (Fig. 5A). All of the Arc-ALX-Arc andArc-SLX-Arc proteins were expressed at high levels, were purified, and had CD and fluorescence spectra similar to wild-type Arc. In the ALXlibrary, variants with eight or more linker alanines showed some tendency to aggregate during purification and handling but were monomeric atconcentrations of 1–20 µM as judged by analytical ultracentrifugation and the concentration independence of equilibrium stability and refoldingrates. All other proteins in the ALX and SLX libraries were highly soluble.

The number of non-glycine residues in the 19-residue linker has a significant effect on the equilibrium stability of proteins in both theALX and SLX libraries, as determined by urea denaturation. In the ALX library (Fig. 4 A and B), Arc-AL11-Arc, which contains 11 alaninesand 5 glycines in the randomized portion of the linker, has the maximum stability (∆Gu� 11 kcal/mol; Ceff�8 M). Arc-AL3-Arc, with 3 alaninesand 13 glycines in the randomized region of the linker, is far less stable (∆Gu�3kcal/mol; Ceff�10 µM), suggesting that too much linkerflexibility is detrimental to stability. Fig. 4B shows, however, that stability also decreases when the number of alanines is increased past theoptimum value of 11, indicating that linkers that are too inflexible also limit protein stability. The same general trends are observed in the SLXlibrary; proteins with too many or too few glycines are significantly less stable than Arc-SL7-Arc (∆Gu�7kcal/mol; Ceff�7mM). There are,however, two significant differences between the ALX and SLX results. Maximum stability occurs for a protein containing eight glycines in therandomized portion of the linker in the SLX library but for a protein containing only five glycines in this region in the ALX library. Moreover,the stabilities of the most stable variants in each library also differ significantly; Arc-AL11-Arc has an effective concentration that is 1,000-foldgreater than Arc-SL7-Arc. We interpret these differences as indicating that the identity of the non-glycine residues in the linker is as importantas the number of these residues in determining stability. By contrast, the positions of the glycine and non-glycine residues in the randomizedportion of the linker seem to be unimportant. Five pairs of variants in the ALX library and three pairs in the SLX library have the samecomposition but difference sequences. In each of these cases, the stabilities of these variants (indicated by open and closed symbols in Figs. 4Band 5B) were found to be within experimental error.

Another significant difference between the ALX and SLX libraries is observed in the unfolding kinetics (Figs. 4D and 5D). In the ALXlibrary, the unfolding rate of different variants only changes by a factor of 20. In the SLX library, the unfolding rates change by >1,000-fold. Inaddition, the shapes

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5931

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 93: (NAS Colloquium) Computational Biomolecular Science

of these plots are very different. The ALX data is concave upward with minimum occurring for the protein with seven alanines and eightglycines in the randomized portion of the linker. In the SLX library, by contrast, ku decrease exponentially with the number of serines. The rateconstants for refolding in the ALX library change by more than five orders of magnitude, reaching a maximum for variants with 11 or 12alanines in the randomized part of the linker (Fig. 5C). Because changes in the unfolding rate are small for the ALX proteins, the changes inequilibrium stability arise almost exclusively from changes in the refolding rate. In the SLX library, variants differ over a 300-fold range inrefolding rates with a maximum between four and seven serines. Because

FIG. 4. Properties of ALX variants with 19-residue linkers and differing in Ala/Gly composition numbers of alanines and glycines. (A)Linker sequences. (B) Equilibrium stability and effective concentration vary with number of alanines. For compositional isomers. closed andopen symbols represents “a” and “b” variants, respectively. Error bars indicate one SD from three independent experiments. (C) Folding ratesin 4.5 M urea. (D) Unfolding rates in 9.1 M urea. See Fig. 3 for conditions.

FIG. 5. Properties of SLX variants with 19-residue linkers differing in Ser/Gly composition. (A) Linker sequences. (B) Equilibriumstability and effective concentration vary with number of serines. For compositional isomers, closed and open symbols represents “a” and “b”variants, respectively. Error bars indicate one SD from three independent experiments. (C) Folding rates in 2.25 M urea. (D) Unfolding rates in9.1 M urea. See Fig. 3 for conditions.

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5932

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 94: (NAS Colloquium) Computational Biomolecular Science

much larger changes are seen in the unfolding rates, the changes in equilibrium stability for the SLX proteins are dominated by the changes inunfolding kinetics. These results emphasize once again that the chemical identity of the non-glycine residues in the linker can have a profoundeffect on the biophysical properties of the single-chain proteins.

DISCUSSION

Linker length and composition exert a surprisingly large influence on the stability of single-chain Arc represser. In the LLX linker lengthlibrary, the most stable protein has a linker of 19 residues, and adding or deleting a few amino acids decreases stability (Fig. 3B). These lengtheffects on stability arise from changes in the folding and unfolding rates. In the regime from 59 to 13 residues, shortening the linker acceleratesfolding. This observation is explained most simply if the denatured subunit domains are constrained to smaller and smaller regions ofconformation space by shorter linkers and thus require less random sampling before essential collisions required for folding occur. We note,however, that the length dependence of the stability of single-chain Arc variants in this regime is significantly steeper than for loop-lengthvariants of single-chain Rop (29) and is modeled poorly by simple, random walk, entropic considerations (30). As the linker length decreasesfrom 13 to 11 to 9 residues, there is a decrease in the folding rate of the corresponding Arc-LLX-Arc protein. At some point, the linkers mustbecome too short to connect the subunits in the native conformation without strain. In fact, in the linker length regime from 19 to 9 residues, theunfolding rates of the corresponding Arc-LLX-Arc proteins increases exponentially as the linkers become shorter, suggesting that shortertethers in this length range introduce more and more strain into the native structure. Presumably, proteins with the LL17, LL15, and LL13linkers do not show decreased folding rates because of compensating changes in conformational search efficiency.

Glycine is generally used in designed linkers because the absence of a β-carbon permits the polypeptide backbone to access dihedralangles that are energetically forbidden for other amino acids (31). Thus, a glycine-rich linker will be more flexible than a linker of comparablelength composed of non-glycine residues. Our results, however, indicate that too much linker flexibility is detrimental to single-chain proteinstability. In the ALX (alanine/glycine) library, maximum stability was observed when the 16-residue-randomized region contained 11 alaninesand 5 glycines. In the SLX (serine/ glycine) library, the most stable protein had seven serines and nine glycines in the randomized portion of thelinker. In both libraries, plots of stability vs. the number of non-glycine residues are relatively regular and proteins with the same linkercompositions have comparable stabilities (Figs. 4B and 5B). Both observations suggest that it is the composition rather than the sequence of thelinker that is important in determining stability. A single exception to this generalization is provided by Arc-LL19-Arc and Arc-SL3-Arc,which have the same composition but stabilities differing by 3.4 kcal/mol. The first three residues of the linker are Gly-Thr-Ser in Arc-SL3-Arc, which has lower stability, and Gly-Gly-Gly in Arc-LL19-Arc, suggesting that the conformational flexibility imparted by glycine may beimportant at the junction between C terminus of the first subunit and the N terminus of the linker.

In the ALX library, the main effects of alanine composition on stability result from changes in the refolding rate. For example, as thenumber of alanines in the linker increases from 3 to 11, the folding rates of the corresponding proteins increase by 30,000-fold. Alaninerestricts the number of allowed conformations of the linker compared with glycine and, in this length regime, probably accelerates theconformational search that occurs during folding. Increasing the number of alanines to 14 or 15 then reduces the folding rate, probably becausethese linkers become too inflexible. When serine is substituted for glycine, there are also effects on the refolding rate but with severaldifferences: the optimal number of serines is smaller than the optimal number of alanines (7 Ser vs. 11 Ala), the difference between the fastestand slowest folders are smaller �2,000-fold for SLX vs, �30,000-fold for ALX), and the maximum folding rates are different (in 2.25 M urea,the fastest ALX protein folds �250 times faster than the fastest SLX protein). Clearly, alanine and serine affect linker flexibility in ratherdifferent ways.

Large differences between alanine and serine are also apparent when comparing effects on the unfolding rate. As the number of serines inthe linker increases, the unfolding rate continues to decrease over a 5,000-fold range (Fig. 5D). By contrast, in the alanine library, the minimumunfolding rate is observed for a protein with seven alanines and the total change between the slowest and fastest unfolders is only 15-fold. Wepresume that the ability of serine to form hydrogen bonds allows formation of new stabilizing interactions in the native state but whether theseinteractions are within the linker or involve interactions between the linker and the body of the single-chain protein is unknown. Becausealanines in the linker primarily affect folding rates whereas serine has the largest effects on unfolding rates, it seems possible that optimizingthe composition of Gly, Ser, and Ala in a linker library might produce single-chain molecules with even greater stabilities than those describedhere. Preliminary studies also suggest that the effects of length and composition may be interdependent. For example, linkers of differentlengths may have different optimal compositions.

Variations in linker length or composition caused no significant changes in represser activity in vivo except in proteins with linkers shorterthan 11 residues. In gel mobility-shift assays, Arc-LL19-Arc and Arc-LA11-Arc, which have 19-residue linkers, bound operator DNA asstrongly as wild-type Arc dimers (data not shown). In earlier work, however, we found that Arc-L1-Arc (which is identical to Arc-LL15-Arc)had a 10-fold enhanced affinity for operator DNA (9, 26). In single-chain Arc, the linker connects the N-terminal arm of the second subunit tothe C terminus of the first subunit; in wild-type Arc, this N-terminal arm is disordered in solution (32) but folds against the operator in theprotein-DNA complex (33). The L1/LL15 linker may increase operator affinity by helping to restrict the conformation of the arm in solution,thereby reducing the entropic penalty for ordering the arm upon DNA binding (9). By this model, lengthening the linker to 19 residues probablyreduces constraints on the arm conformation.

In summary, we find that changes in linker length and composition can produce substantial changes in the stability and folding kinetics ofsingle-chain Arc. Poly-glycine linkers maximize the conformational freedom of the polypeptide backbone but do not result in optimal stability.For single-chain or hybrid protein designs that have folding problems, alterations in linker length and/or composition should provide a usefulmethod for increasing stability.

We thank David Goldenberg for helpful discussions. This work was supported by an National Institutes of Health postdoctoral fellowship(to C.R.R.) and by National Institutes of Health Grant AI-15706 (to R.T.S.).1. Bird, R.E., Hardman, K.D., Jacobson, J.W., Johnson, S., Kaufman, B.M., Lee, S.-M., Lee, T., Pope, S.H.. Riordan, G.S. & Whitlow, M. (1988)

Science 242, 423–426.2. Pomerantz, J.L., Sharp, P.A. & Pabo, C.O. (1995) Science 267, 93–96.3. Predki, P.F. & Regan, L. (1995) Biochem. 34, 9834–9839.

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5933

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 95: (NAS Colloquium) Computational Biomolecular Science

4. Hallewell, R.A., Laria, I., Tabrizi, A., Carlin, C., Getzoff, E.D., Tainer, J.A., Cousens, L.S. & Mullenbach, G.T. (1989) J. Biol. Chem. 264, 5260–5268.5. Bizub, D., Weber, I.T., Cameron, C.E., Leis, J.P. & Skalka, A.M. (1991) J. Biol. Chem. 266, 4951–4958.6. Kim, S.-H., Kang, C-H., Kim, R., Cho, J.M., Lee, Y.-B. & Lee, T.-K. (1989) Protein Eng. 2, 571–575.7. Liang, H., Sandberg, W.S. & Terwillinger, T.C. (1993) Proc. Natl. Acad. Sci. USA 90, 7010–7014.8. Toth, M.J. & Schimmel, P. (1986) J. Biol. Chem. 261, 6643–6646.9. Robinson, C.R. & Sauer, R.T. (1996) Biochem. 35, 109–116.10. O’Shea, E.K., Rutkowski, R. & Kim, P.S. (1992) Cell 68, 699–708.11. Pantoliano, M.W., Bird, R.E., Johnson, S., Asel, E.D., Dodd. S.W., Wood, J.F. & Hardman, K.D. (1991) Biochem. 30, 10117–10125.12. Mallender, W.D. & Voss, E.W., Jr. (1994) J. Biol Chem. 269, 199–206.13. Rumbley, C.A., Denzin, L.K., Yantz, L., Tetin, S.Y. & Voss, E.W., Jr. (1993) J. Biol. Chem. 268, 13667–13674.14. Stemmer, W.P., Morris, S.K. & Wilson, B.S. (1993) BioTechniques 14, 256–265.15. Lieschke, G.J., Rao, P.K., Gately, M.K. & Mulligan, R.C (1997) Nat. Biotech. 15, 35–40.16. Eustance, R.J. & Schleif, R.F. (1996) J. Bacterial. 178, 7025– 7030.17. Govindaraj, S. & Poulos, T.L. (1996) Protein Sci. 5, 1389–1393.18. Kortt, A.A., Lah, M., Oddie, G.W., Gruen, C.L., Burns, J.E., Pearce, L.A., Atwell, J.L., McCoy, A.J., Howlett, G.J., Metzger, D.W., et al (1997)

Protein Eng. 10, 423–433.19. Whitlow, M., Bell, B.A., Feng, S.-L., Filpula, D., Hardman, K.D., Hubert, S.L., Rollence, M.L., Wood, J.F., Schott, M.E., Milenic, D.E., et al.

(1993) Protein Eng. 6, 989–995.20. Deonarain, M.P., Rowlinson-Busza, G., George, A.J.T. & Epenetos, A.A. (1997) Protein Eng. 10, 89–98.21. Tang, Y., Jiang, N., Parakh, C. & Hilvert, D. (1996) J. Biol. Chem. 271, 15682–15686.22. Newton, D.L., Xue, Y., Olson, K.A., Fett, J.W. & Rybak, S.M. (1996) Biochem. 35, 545–553.23. Huston, J.S., McCartney, J., Tai, M.-S., Mottola-Hartshorn, C, Jin, D., Warren, F., Keck, P. & Oppermann, H. (1993) Int. Rev. Immunol. 10, 195–217.24. Bowie, J.U. & Sauer, R.T. (1989) Proc. Natl Acad. Sci. USA 86, 2152–2156.25. Milla, M.E., Brown, B.M. & Sauer, R.T. (1993) Protein Sci. 2, 2198–2205.26. Robinson, C.R. & Sauer, R.T. (1996) Biochem. 35, 13878–13884.27. Poljak, R.J. (1994) Structure 2, 1121–1123.28. Perisic, O., Webb, P.A., Holliger, P., Winter, G. & Williams, R.L. (1994) Structure 2, 1217–1226.29. Nagi, A.D. & Regan, L. (1997) Fold. Des. 2, 67–75.30. Chan, H.S. & Dill, K.A. (1988) J. Chem. Phys. 90, 492–509.31. Ramachandran, G.N. & Sasisekharan, V. (1968) Adv. Protein Chem. 23, 283–437.32. Breg, J.N., van Opheusden, J.H.J., Burgering, M.J.M., Roelens, R. & Kaptein, R. (1990) Nature (London) 346, 586–589.33. Raumann, B.E., Rould, M.A., Pabo, C.O. & Sauer, R.T. (1994) Nature (London) 367, 754–757.34. Kraulis, P.J. (1991) J. Appl. Cryst. 24, 946–950.

OPTIMIZING THE STABILITY OF SINGLE-CHAIN PROTEINS BY LINKER LENGTH AND COMPOSITION MUTAGENESIS 5934

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 96: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5935–5941, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Architecture and mechanism of the light-harvesting apparatus ofpurple bacteria

XICHE HU, ANA DAMJANOVI , THORSTEN RITZ, AND KLAUS SCHULTEN

Beckman Institute and Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801ABSTRACT Photosynthetic organisms fuel their metabolism with light energy and have developed for this purpose an efficient

apparatus for harvesting sunlight The atomic structure of the apparatus, as it evolved in purple bacteria, has been constructed througha combination of x-ray crystallography, electron microscopy, and modeling. The detailed structure and overall architecture reveals ahierarchical aggregate of pigments that utilizes, as shown through femtosecond spectroscopy and quantum physics, elegant andefficient mechanisms for primary light absorption and transfer of electronic excitation toward the photosynthetic reaction center.

The prevalent color green in Earth’s biosphere is testimony to the important role that chlorophylls play in harnessing the energy of the Sunto fuel the metabolism of photosynthetic life forms. Chlorophylls are assisted in their light-harvesting role by carotenoids, also widely knownthrough their coloration of petals and fruits in plants. Photosynthetic organisms have evolved intricate aggregates of chlorophylls andcarotenoids for efficient light harvesting and exploit in subtle ways the laws of quantum mechanics. This role of chlorophylls and carotenoidshas emerged in full detail only recently, when the atomic structures of proteins involved in bacterial photosynthetic light harvesting have beensolved by a combination of x-ray crystallography, electron microscopy, and molecular modeling.

However, the conceptual foundation for our present understanding of light harvesting was laid long ago, when Emerson and Arnolddemonstrated that it required hundreds of chlorophylls to reduce one molecule of CO2 under saturating flash light intensity (1, 2). To explainthe cooperative action of these chlorophylls, Emerson and Arnold postulated that only very few chlorophylls in the primary reaction site,termed the photosynthetic reaction center (RC), directly take part in photochemical reactions; most chlorophylls serve as light-harvestingantennae by capturing the sunlight and funneling electronic excitation toward the RC. This notion gave rise to the definition of thephotosynthetic unit (PSU) as an ensemble of an RC with associated light-harvesting complexes containing up to 250 chlorophylls, and becamewidely accepted only when Duysens carried out a critical experiment in which energy transfer between different chlorophylls was observed (3).

A wealth of accumulated evidence proves that the organization of PSUs, to surround an RC with aggregates of chlorophylls and associatedcarotenoids, is universal in both photosynthetic bacteria and higher plants (2, 4–6).

Of the known photosynthetic systems, the PSU of purple bacteria is the most studied and best characterized. Fig. 1 depicts schematicallythe intracytoplasmic membrane of purple bacteria with its primary photosynthetic apparatus. In the PSU, an array of light-harvesting complexescaptures light and transfers the excitation energy to the photosynthetic RC. This article focuses on the primary processes of light harvesting andelectronic excitation transfer that occur in the PSU, and describes the role of molecular modeling in elucidating the underlying mechanisms.

In most purple bacteria, the photosynthetic membranes contain two types of light-harvesting complexes, light-harvesting complex I (LH-I)and light-harvesting complex II (LH-II) (7). LH-I is found surrounding directly the RCs (8, 9), whereas LH-II is not in direct contact with theRC but transfers energy to the RC through LH-I (10, 11). For some bacteria, such as Rhodopseudomonas (Rps.) acidophila and Rhodospirillum(Rs.) molischianum strain DSM 120 (12), there exists a third type of light-harvesting complex, LH-III. A 1:1 stoichiometry exists between theRC and LH-I (9); the number of LH-IIs and LH-IIIs varies according to growth conditions such as light intensity and temperature (13).

Purple bacteria absorb light in a spectral region complementary to that of plants and algae, mainly at wavelengths of about 500 nm throughcarotenoids and above 800 nm through bacteriochlorophylls (BChls). Fig. 2 shows the energy levels for the key electronic excitations in thePSU. There exists a pronounced energetic hierarchy in the light-harvesting system: LH-III absorbs light at the highest energy (800 and 820nm); the LH-II complex, which surrounds LH-I, absorbs maximally at 800 nm and 850 nm; and LH-I, which in turn surrounds the RC, absorbsat a lower energy (875 nm) (11). The energy cascade serves to funnel electronic excitations from the LH-IIIs and LH-IIs through LH-I to theRC. Time resolved picosecond and femtosecond spectroscopy revealed that excitation transfer within the PSU occurs on a subpicosecond timescale and at near unit (95%) efficiency (14, 15).

Today, structures of the major components of the bacterial photosynthetic apparatus are available at atomic resolution. Structures of theRC are known for Rps. viridis (16) as well as for Rhodobacter (Rb.) sphaeroides (17). Recently, high resolution crystal structures of LH-II havebeen determined for Rps. acidophila (18) and for Rs. molischianum (19). Based on a high degree of homology of the αβ-heterodimer of LH-Ifrom Rb. sphaeroides to that of LH-II of Rs. molischianum (12, 20), an atomic structure for LH-I of Rb. sphaeroides has been modeled (21).

Structure of Light-Harvesting Complexes

Accordingly, a structural model for the bacterial PSU has been established and consists of LH-IIs, LH-I, and the RC; this model providesdetailed knowledge of the organization of

© 1998 by The National Academy of Sciences 0027–8424/98/955935–7$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: RC, reaction center: PSU: photosynthetic unit; LH-I and LH-II, light-harvesting complexes I and II; BChl,

bacteriochlorophyll; PBS, phycobilisome; PCP, peridinin-chlorophyll-protein.

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5935

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 97: (NAS Colloquium) Computational Biomolecular Science

chromophores in the photosynthetic membrane and opens a door to the study of excitation transfer in the PSU based on a priori principles.

FIG. 1. Schematic representation of the photosynthetic apparatus in the intracytoplasmic membrane of purple bacteria. TheRC (red) is surrounded by the light-harvesting complex I (LH-I, green) to form the LH-I-RC complex, which is surroundedby multiple light-harvesting complexes LH-II (green), forming altogether the PSU. Photons are absorbed by the light-harvesting complexes and excitation is transferred to the RC initiating a charge (electron-hole) separation. The RC bindsquinone QB, reduces it to hydroquinone QBH2, and releases the latter. QBH2 is oxidized by the bci complex, which uses theexothermic reaction to pump protons across the membrane; electrons are shuttled back to the RC by the cytochrome c2

complex (blue) from the ubiquinone-cytochrome bc1 complex (yellow). The electron transfer across the membrane producesa large proton gradient that drives the synthesis of ATP from ADP by the ATPase (orange). Electron flow is represented inblue, proton flow in red, and quinone flow, likely confined to the intramembrane space, in black.

LH-II. The structure of LH-II from Rs. molischianum had been determined to 2.4 Å resolution (19) and is shown in Fig. 3a. The complexis an octameric aggregate of αβ-heterodimers; the latter contains a pair of short peptides (α- and β-apoproteins) noncovalently binding threeBChl a molecules and one lycopene (a specific type of carotenoid). Presumably, there exists a second lycopene for each αβ-heterodimer. Theelectron density map indeed contains a stretch of assignable density, but the stretch is not long enough to positively resolve the entire lycopene(19). Two concentric cylinders of α-helices, with the α-apoproteins inside and the β-apoproteins outside, form a scaffold for BChls andlycopenes. Fig. 3b depicts the 24 BChl molecules and 8 lycopene molecules in LH-II with all other components stripped away. Sixteen B850BChl molecules form a continuous overlapping ring of 23 Å radius (based on central Mg atoms of BChls) with each BChl orientedperpendicular to the membrane plane. The Mg–Mg distance between neighboring B850a and B850b BChls is 9.2 Å (within an αβ-heterodimer)and between B850a� and B850b is 8.9 Å (between heterodimers). Eight B800 BChls, forming another ring of 28 Å radius, are arranged withtheir tetrapyrrol rings nearly parallel to the membrane plane and exhibit a Mg–Mg distance of 22 Å between neighboring BChls, i.e., the BChlsare coupled only weakly. The ligation sites for the B850 BChls are α-His-34 and β-His-35, and the B800 BChls ligate to α-Asp-6. Eightlycopene molecules span the transmembrane region: each makes contact with B800 BChl and the B850a BChl.

FIG. 2. Energy levels of the electronic excitations in the PSU of BChl a containing purple bacteria. The diagram illustrates afunneling of excitation energy toward the photosynthetic RC. The dashed lines indicate (vertical) intracomplex excitationtransfer, and the solid lines (diagonal) indicate intercomplex excitation transfer. LH-I exists in all purple bacteria; LH-IIexists in most species; LH-III arises in certain species only.

FIG. 3. The octameric LH-II complex from Rs. molischianum (19). (a) The α-helical segments are represented as cylinderswith the α-apoproteins (inside) in blue and the β-apoprotein (outside) in magenta. The BChl molecules are in green withphytyl tails truncated for clarity. The lycopenes are in yellow, (b) Arrangement of chromophores with BChls represented assquares, and with carotenoids (lycopenes) in a licorice representation. Bars connected with the BChls represent the Qytransition dipole moments as defined by the vector connecting the N atom of pyrrol I and the N atom of pyrrol III (22).Representative distances between central Mg atoms of B800 BChl and B850 BChl are given in Å. The B850 BChls bound tothe α-apoprotein and the β-apoprotein are denoted as BS50a and B850b, respectively; BChl B850a� is bound to the (left)neighboring heterodimer.

It is remarkable that LH-II results from the self-aggregation of a large number of identical, noncovalently bonded transmembrane helices,BChls, and carotenoids. With its simple, symmetric architecture, LH-II constitutes an ideal model system for studying aggregate formation andadhesive interactions of proteins. Mechanical models reveal perfect self-complementarity of the αβ-heterodimers that interlock with each otherto form a circular aggregate (23).

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5936

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 98: (NAS Colloquium) Computational Biomolecular Science

LH-I-RC Complex. LH-I of Rb. sphaeroides has been modeled in ref. 21 as a hexadecamer of αβ-heterodimers; the modeling exploited aclose homology of these heterodimers to those of LH-II of Rs. molischianum. The resulting LH-I structure yields an electron density projectionmap that is in agreement with an 8.5 Å resolution electron microscopy projection map for the highly homologous LH-I of Rs. rubrum (24). TheLH-I complex contains a ring of 32 BChls referred to as B875 BChls according to their main absorption band. The Mg–Mg distance betweenneighboring B875 BChls is 9.2 Å within the αβ-heterodimer and 9.3 Å between neighboring heterodimers.

The modeled LH-I has been docked to the photosynthetic RC of Rb. sphaeroides by means of a constrained conformational search (21),employing for the latter the structure reported in ref. 17. Fig. 4a presents the LH-I-RC complex. The arrangement of the BChls in the LH-I-RCcomplex is depicted in Fig. 4b. One can discern the ring of B875 BChls of LH-I that surrounds the RC special pair (PA and PB) and the so-calledaccessory BChls (BA, BB). The closest distance between the central Mg atom of the RC’s special pair (BChls PA, PB) and the Mg atom of theBChls in LH-I is 42.6 Å. The distance between the Mg atom of the accessory BChl (BChls BA, BB) and the LH-I BChls is shorter, the nearestdistance measuring 35.7 Å. Rb. sphaeroides contains an additional PufX gene of unknown function. It has been suggested that the PufX proteinmay substitute one or more αβ-heterodimers of LH-I to open up the circular ring shown in Fig. 4a and to facilitate thereby the flow of quinones(QB/QBH2) between the RC and the cytochrome bc1 complex (see Fig.1) (4, 9).

FIG. 4. Structure of the LH-I-RC complex, (a) Side view of the LH-I-RC complex with three LH-I αβ-heterodimers on thefront side removed to expose the RC in the interior. The α-helices are represented as cylinders with the L, M, and H subunitsof the RC in yellow, red, and gray, and the α-apoprotein and the β-apoprotein of the LH-I in blue and magenta. BChls andbacteriopheophytins are represented as green and yellow squares, respectively. Carotenoids (spheroidenes) are in a yellowlicorice representation, and quinone QB is rendered by gray van der Waals spheres. QB shuttles in and out (as QBH2) of theLH-1-RC complex as indicated in Fig. 1. (b) Arrangement of BChls in the LH-I-RC complex. The BChls are represented assquares with B875 BChls of LH-I in green, and the special pair (PA and PB) and the accessory BChls (BA and BB) of the RC inred and blue, respectively: cyan bars represent Qy transition moments of BChls. [Produced with the program VMD (25)].

The PSU. Fig. 5 presents a model of the PSU for Rb. sphaeroides. Only three LH-IIs are shown. The actual photosynthetic apparatus cancontain up to about 10 LH-IIs around each LH-I. Because electron microscopy observations suggest that LH-II of Rb. sphaeroides containsnine αβ-heterodimers (J.Olsen, personal communication), instead of eight as in LH-II of Rs. molischianum, LH-II of Rb. sphaeroides, as shownin Fig. 5, has been constructed as a nanomer of αβ-heterodimers by means of homology modeling by using the αβ-heterodimer of LH-II fromRps. acidophila as a template. For this purpose, the modeling protocol developed and applied successfully in refs. 19–21 was used.

Two essential features of the pigment organization of the PSU, as depicted in Fig. 5, are (i) the ring-like aggregates of tightly coupledBChls within LH-I and LH-II, and (ii) the coplanar arrangement of these BChls and of the BChls in the RC. Analysis of the LH-I and LH-IIstructures as reported in refs. 21 and 26 indicates that each BChl of the B850 ring of LH-II and of the B875 ring of LH-I is noncovalentlybound to three side-chain atoms of the α- or β-apoprotein such that the BChls are held in a rigid orientation. The planar organization of theBChls in the PSU is optimal for the transfer of electronic excitation to the RC.

Mechanisms of Excitation Transfer

Photosynthetic bacteria evolved a pronounced energetic hierarchy in the light-harvesting system. The hierarchy, as shown in Fig. 2,furnishes a cascade-like system of excited states that funnels electronic excitation from the outer LH-IIs through LH-I to the RC. The excitationtransfer cascading into the RC involves intracomplex and intercomplex processes, defined as excitation transfer within each pigment-proteincomplex (LH-II, LH-I, RC) and between pigment-protein complexes (LH-II →LH-II, LH-II→LH-I, LH-I→RC), respectively. Intracomplextransfer, for the main part, occurs faster than intercomplex transfer. We will first discuss intercomplex excitation transfer, and then we willdescribe intracomplex excitation transfer.

FIG. 5. Arrangement of pigment-protein complexes in the modeled bacterial PSU of Rb. sphaeroides. The α-helices arerepresented as Cα-tracing tubes with α-apoproteins of both LH-I and LH-II in blue and β-apoproteins in magenta, and the L,M, and H subunits of RC in yellow, red, and gray, respectively. All the BChls are in green, and carotenoids are in yellow.[Produced with the program VMD (25)].

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5937

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 99: (NAS Colloquium) Computational Biomolecular Science

Exciton Migration. One of the most intriguing structural features of the bacterial light-harvesting complexes is the circular organizationof BChl aggregates (2). To understand the primary processes of light absorption and the subsequent excitation transfer from LH-IIs, throughLH-I, to the RC, it is essential to characterize the electronic properties of the excited states of the circular BChl aggregate. The close proximityof the B850 BChls in LH-II implies strong interactions, leading to coherent superpositions, termed excitons (27, 28), of the lower energyexcited states of individual BChls, the Qy states as demonstrated in INDO-CIS level quantum chemical calculations of the complete circularaggregates of 16 B850 BChls and 8 B800 BChls of LH-II from Rs. molischianum (29). As shown in Fig. 6, two bands of excitons arise, withthe band splitting reflecting a weakly dimerized form of the aggregate (the BChl-BChl distances alternate slightly along the ring) (26). Only 2of the 16 exciton states are optically allowed and thus carry an 8-fold enhanced oscillator strength (superradiance); the lowest excited state isoptically forbidden and does not fluoresce, which may allow LH-II to preserve excitation energy, though disorder confers readily oscillatorstrength to this state (26).

FIG. 6. BChl-carotenoid interactions. (Upper Left) Exciton bands of the circular B850 BChl aggregate as determined byquantum chemical (INDO/S) calculations (29) based on coordinates of the crystal structure of LH-II from Rs. molischianum(19). The degenerate states that carry all the oscillator strength are highlighted by thickened lines. (Upper Right) Excitationenergies of BChl and carotenoid states in LH-II of Rb. sphaeroides. Solid lines represent spectroscopically measured energylevels. The dashed line indicates the estimated (see refs. 44 and 47) energy for the optically forbidden S1 state of thecarotenoid spheroidene. (Lower) Arrangement of spheroidene and the most proximate BChls based on the modeled structureof LH-II from Rb. sphaeroides. Close contacts between BChl and the carotenoid spheroidene are indicated by representativedistances (in angstroms).

Other, less extensive calculations, ranging from an effective Hamiltonian representation based on the point dipole treatment (30, 31) to apoint monopole treatment (32), and to the quantum mechanical consistent-force-field/π-electron (QCFF/PI) approach (33), yield a similarexciton band structure but differ in detailed exciton levels and band gaps. According to the INDO-CIS calculation (29), the lowest exciton stateis significantly lowered in energy through level repulsion with charge resonance states, resulting in an energy gap ∆ of 422 cm–1 (see Fig. 6).The optically allowed exciton states should then be populated 9% at thermal equilibrium at room temperature.

To extend the quantum chemical calculations to LH-I and the complete PSU, an effective Hamiltonian H � in the basis of single BChl Qyexcitations had been established in ref. 26. The matrix elements of the Hamiltonian describe couplings between neighboring Qy states by j|H�|j+1, assuming values of v1 (v2) for odd (even) j. The diagonal elements j|H �|j=ε account for the excitation energy of the Qy state of individualBChls. All other elements of H are approximated by dipole-dipole coupling terms

where d �j are unit vectors describing the direction of the transition dipole moments of the ground state→Qy state transition of the j-th BChland r�jk is the vector connecting the centers of BChl j and BChl k. The adjustable parameters of the effective Hamiltonian were determined inref. 29 to reproduce the exciton spectrum in Fig. 6: ε=13,242cm–1, v1=790cm–1, v2=369 cm–1, and C=505,644 Å3·cm–1. The effectiveHamiltonian was extended in ref. 34 to incorporate two exciton states as they arise in pump-probe spectroscopy (35).

The effective Hamiltonian can be applied without further modification to describe the circular aggregate of 32 BChls in LH-I (26). Thesame characteristics of the exciton bands as in the B850 BChl aggregate of LH-II are found, i.e., the second and the third exciton states carry allthe oscillator strength, with the lowest energy excitation state being optically forbidden.

The exciton states in LH-II and LH-I are completely delocalized over the ring-like B850 and B875 aggregates because of the assumptionof perfect symmetry, i.e., absence of disorder, It is widely believed that the B850 BChl excited states, despite

FIG. 7. Excitation transfer in the bacterial photosynthetic unit. LH-II contains two types of BChls, commonly referred to asB800 (dark blue) and B850 (green), which absorb at 800 nm and 850 nm, respectively. BChls in LH-I absorb at 875 nm andare labeled B875 (green). PA and PB refer to the RC special pair, and BA, BB refer to the accessory BChls in the RC. Thefigure demonstrates the coplanar arrangement of the B850 BChl ring in LH-II, the B875 BChl ring of LH-I, and the RCBChls PA, PB, BA, BB. [Produced with the program VMD (25)].

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5938

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 100: (NAS Colloquium) Computational Biomolecular Science

natural disorder, are delocalized, but the extent of delocalization has been debated (15). The estimate for the number of coherently coupledBChls ranges from two BChl molecules (36) to the entire length of the B850 BChl aggregate (37). In principle, the relative strengths of thedisorder and of the coupling between BChls determine the delocalization length. According to the INDO-CIS calculation (29), the effectivecoupling between nearest neighbor BChls is 790 cm–1 (v1) within the αβ-heterodimer and 369 cm–1 (v2) between the αβ-heterodimers. Theeffect of static disorder has been modeled in ref. 26 by randomizing the diagonal elements of an effective Hamiltonian. By using a distributionconsistent with the inhomogeneous broadening measured by hole-burning spectroscopy, the effect of diagonal disorder on excitondelocalization was found to be noticeable but small.

It has long been observed that excitation transfer LH-II→ LH-I→RC occurs in the PSU in fewer than 100 ps and with about 95%efficiency (14). In this respect, it is interesting to note that the transition dipole moments of the Qy excitations of the B850 and B875 BChls areall oriented in the two-dimensional plane that encompasses the ring-like BChl aggregates of LH-II, LH-I, and the RC special pair and isoptimally attuned to the desired flow of electronic excitation LH-II→ LH-I→RC. There are many potential pathways for photons to beabsorbed and for the subsequent excitations to reach the RC. A path may begin with absorption of an 800 nm photon by one of the B800 BChlsin LH-II (see Fig. 7). At least three sequential steps are required for the B800 excitation to be transferred to the RC: B800 (LH-II)→B850 (LH-II)→LH-I →RC. Time resolved picosecond and femtosecond spectroscopy revealed that the B800→B850 excitation transfer proceeds withinabout 700 fs (14, 38). Two color pump-probe femtosecond measurements determined a time constant of 3~5 ps for the B850→LH-I step (39).The final LH-I→RC transfer step requires about 35 ps (40), i.e., this is the slowest step (11, 14). Intercomplex LH-II→LH-II transfer mayoccur, but a rate for this process has not yet been determined.

The effective Hamiltonian for LH-II, as described above, had been extended in refs. 26 and 34 to describe the exciton system of the entireaggregate shown in Fig. 7. One can determine the transfer rates between the different components, i.e., LH-II→LH-I, and LH-I→RC, by usinga perturbation scheme (34). The calculated time constants of 3.3 and 65 ps for the excitation transfer processes LH-II→LH-I and LH-I→RC inRb. sphaeroides, respectively, are in agreement with experimental values of 3~5 ps and 35 ps (39, 40). A startling result from these calculationshas been a suggested role of the accessory BChls as mediators of the excitation transfer from LH-I to the RC special pair: the calculated timefor LH-I→RC transfer, in the absence of accessory BChls, is about 600 ps, which is an order of magnitude too long compared withobservations; the accessory BChls in RC provide a path for the excitation transfer that bridges the large distance of 42 Å or longer between LH-I BChls and the RC special pair.

Role of B800 BChls and Carotenoids. B800 BChls absorb light in a slightly higher spectral region than the B850 BChls and are orientedsuch that they absorb in a direction perpendicular to that of the B850 BChls. Quantum chemical calculations in ref. 29 have demonstrated thatthe B800 BChls are only weakly coupled with each other and with the B850 BChls. The individual B800 BChls transfer the resulting excitationenergy to the B850 ring through the so-called Forster mechanism (41–43). The transfer proceeds within 700 fs (38). Quantum mechanicalcalculations show that this short transfer time, to a large degree, results through the exciton splitting of the accepting B850 exciton levels shownin Fig. 6; the exciton splitting greatly improves the resonance of the excitations of B800 and B850 BChls (44). Carotenoids absorb light at 500nm into a strongly allowed state and transfer the excitation energy within 200 fs and with nearly 100% efficiency to the Qy exciton states of theB850 ring (45). The question arises by which pathways and by which mechanism such an efficient excitation transfer is achieved.

Fig. 6 presents also the excitation energies of the spheroidene and BChl states in LH-II of Rb. sphaeroides. Spheroidene features two low-lying singlet excited states. A strongly allowed state absorbing at 500 nm is labeled S2. It decays within <200 fs into an optically forbiddenelectronic state labeled S1, which has been characterized in refs. 46 and 47. The S1 state is in resonance with the accepting Qy exciton statesand, thus, provides a possible gateway for transfer to the B850 ring.

The optically forbidden character of the S1 state of spheroidene precludes its coupling to the B850 ring through the Forster mechanism,thus limiting potential mechanisms to coupling through Coulomb interaction including higher-order multipoles (generalized Forstermechanism) or coupling through electron exchange [Dexter mechanism (48)]. The Dexter mechanism requires an overlap of donor andacceptor wave functions and, thus, is only efficient when donor and acceptor are in van der Waals contact. Because spheroidene and BChls areindeed in close contact, as shown in Fig. 6, one is tempted to suggest that the mechanism underlying singlet excitation transfer is electronexchange.

Recent calculations (44, 49), however, do not support this assumption. Based on the geometric arrangement of carotenoids and BChls inLH-II (Fig. 6) and on CI expansions of the electronic states of carotenoids and chlorophylls, calculations in ref. 44 showed that the generalizedFörster mechanism governs the transfer of singlet excitations, resulting in a transfer time of 260 fs through the S1 (carotenoid)→B850 (excitonstates) pathway.

The transfer through the optically forbidden S1 state is strongly accelerated by the splitting of the B850 exciton levels as seen also in thecase of the B800→B850 transfer (44). Without the exciton splitting, the calculated transfer time is as slow as 2.5 ps. This suggests that purplebacteria may have evolved the ring structure of LH-II to improve resonance between acceptor and donor systems. In addition to transferthrough the forbidden S1 state of spheroidene, the absorbing S2 state of carotenoids is likely to transfer some excitation also directly to the Qxstate of BChl as suggested by the calculated transfer time of 330 fs (44) and the shortened (60 fs) in vivo lifetime of the S2 state (50).

In addition to the light-harvesting function, carotenoids protect the light-harvesting system from the damaging effect of BChl triplet statesthat arise with a small, but finite, probability and can generate highly reactive singlet oxygen according to the reaction 3O2+3BChl*→+1BChl. Carotenoids prevent this reaction by quenching the BChl triplet states through triplet excitation transfer from the BChls. This transferinvolves a spin change and can only proceed through the electron exchange or Dexter mechanism (48). The triplet excitation transfer in LH-IIof Rs. molischianum has been described in detail (44). The calculations showed that B850a and B800 are well protected by one of the eightlycopenes seen in the crystal structure of LH-II of Rs. molischianum (see Figs. 3 and 6), whereas B850b is not directly protected but cantransfer triplet excitation within a few picoseconds to the well protected B850a BChl.

Other Photosynthetic Organisms

Photosynthetic organisms have developed from a few common components a rather divergent set of antenna systems. The divergence isdemonstrated in Fig. 8, which compares antenna systems of green bacteria, cyanobacteria, dinoflagellates, and green plants: to these examplesis to be added the apparatus of purple bacteria shown in Fig. 1.

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5939

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 101: (NAS Colloquium) Computational Biomolecular Science

FIG. 8. Schematic representation of proposed models of the PSUs in other photosynthetic systems. The figure displays inter-and extramembrane light-harvesting complexes, together with the RCs (RC in green bacteria, and PS-I and PS-II incyanobacteria, dinoflagellates, and green plants). (a) Green bacteria: The major light-harvesting complex, chlorosome,contains rod-like BChl c aggregates surrounded by a layer of protein embedding lipids. Excitation energy harvested by therod-like aggregates reaches the RC through a BChls a containing baseplate and membrane-bound light-harvesting BChl acomplexes. (b) Cyanobacteria: The dominant light-harvesting complex of cyanobacteria and red algae, phycobilisome (PBS),is unique in choosing linear tetrapyrroles as pigments. Several types of disk-like pigmentprotein complexes such as R-phycoerythrin (51) constitute the phycobilisome rods and core. (c) Dinoflagellates: The photosynthetic unit of dinoflagellatesconsists of several membrane-bound pigmentprotein complexes and an extramembrane light-harvesting complex, theperidinin-chlorophyll-protein (PCP). (d) Green plants: Chloroplasts of green plants possess chlorophyll-carotenoid containingLH-CII (6) as the most abundant light-harvesting complex. [Images of R-phycoerythrin and PCP were produced with theprogram VMD (25)].

Anoxygenic photosynthetic (purple and green) bacteria employ a single RC. Oxygen-evolving photosynthetic organisms, e.g.,cyanobacteria, dinoflagellates, and plants, possess in their PSUs two RCs of different types, namely PS-I and PS-II (see Figs. 8 b, c, and e). PS-I shows similarity to the RCs of green sulfur bacteria, whereas PS-II is thought to be evolutionary related to the RC of purple bacteria. PS-I andPS-II have integral light-harvesting pigments associated with them (5). Apart from those integral light-harvesting pigments, oxygen-evolvingphotosynthetic organisms possess additional light-harvesting complexes that display significant structural variability among species.

To illustrate the common components of the light-harvesting systems in Figs. 1 and 8, we summarize the properties of the antenna systemsof purple bacteria as far as they are relevant to photosynthetic life forms in general.

The chromophores of purple bacteria, i.e., BChls and carotenoids, are attuned to their ambient light. In case of lycopene/ spheroidene andB800/B850 BChls, the combined absorption spectrum is complementary to that of chlorophyll a or b in green plants, i.e., adjusted to a habitatbelow plants. The purple bacteria exploit the low-lying excited states of polyenes (46, 47) to couple the carotenoid excitations to BChls. Thecarotenoids are entrusted with the excitation energy for only a few hundred femtoseconds, after which time BChls are the wardens of the energy.

The spectra of BChls are tuned only to a limited degree through interaction with the protein environment, e.g., through formylmethionine-Mg ligation in case of B800 of LH-II from Rps. acidophila (18) or through an Asp-Mg ligation in case of B800 of LH-II from Rs.molischianum (19); the observed spectra result mainly from intrinsic properties of BChls and excitonic interactions (26, 29). Excitonic couplingsplits the excited state energies, thus improving the overlap between donor and acceptor spectra in the excitation cascade (26, 41, 44).

The BChls have the disadvantage that their lowest-energy triplet state lies high enough to excite molecular oxygen. Their companioncarotenoids quench the triplet excitations of BChls.

The efficient flow of excitation through the chromophore system requires highly ordered aggregates, the geometry of which is adapted tothe needed interactions; carotenoids must be in close (van der Waals) contact with BChls for triplet quenching and must be proximate within afew angstroms for transfer of optically forbidden excitations. Chlorophylls, to achieve significant exciton splitting, must have Mg-Mg distancesof about 10 Å; for energy transfer on a picosecond time scale. Mg–Mg distances must be of the order of 20 Å. It is possible that BChls formaggregates to achieve coherence over many chromophores, such that the lowest-energy state becomes optically forbidden, increasing its lifetime.

A multiprotein architecture is necessary to provide a large enough scaffold for the number of chromophores employed in light harvesting.Because of this architecture, antenna systems employ a hierarchy of chromophore aggregates; the chromophores are closer and more tightlycoupled in the individual pigment-protein complex, e.g., in LH-II, and more loosely coupled between different pigment-protein complexes. Thecontrol of the overall aggregation of the multiprotein system is in itself an impressive achievement worthy of study (23).

To direct flow of excitation to the RC, the antenna system of purple bacteria assumes a spatial organization in which the BChls with lowerenergy excitations are closer to the RC. Such arrangement, as shown in Fig. 2, yields an energy funnel that prevents detours in the excitationflow, enhancing the overall efficiency of light harvesting as measured by the quantum yield for a photon absorbed to reach a RC.

The features of light harvesting in purple bacteria can serve as a background to a comparison of the alternative antenna systems shown inFig. 8. As their primary light-harvesting complexes, green bacteria use extramembrane sack-like aggregates of BChl c (d or e in some species)called chlorosomes (Fig. 8a). Chlorosomes consist of pigment oligomers which in some species appear to be rod-shaped aggregates of BChls. Ithas been suggested that the rod-shaped BChl aggregates are stabilized solely through pigment-pigment interactions between the BChls.Chlorosomes are positioned external to the membrane, on top of the RC as shown in Fig. 8a.

In cyanobacteria and red algae, the dominant light-harvesting complexes, shown in Fig. 8b, are extramembrane PBSs with discoidalpigment-protein complexes exhibiting an energy cascade from the outer rod disks toward the core and the RC.

The PSU of dinoflagellates. presented in Fig. 8c, contains, among other pigment-protein complexes, the extramembrane PCP. Recently,the structure of PCP has been solved at 2.0 Å resolution (52). PCP distinguishes itself from other light-harvesting complexes in usingcarotenoids as the predominant light absorbers, exhibiting a chlorophyll-to-carotenoid ratio of

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5940

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 102: (NAS Colloquium) Computational Biomolecular Science

1:4. Efficient excitation transfer between carotenoids and chlorophylls, based on the structure of the aggregate, i.e., close contacts betweenperidinins and chlorophylls, has been confirmed by quantum chemical calculations (T.R., A.D., and K.S., unpublished work).

The most abundant light-harvesting complex located in chloroplasts of green plants is LHCII. shown in Fig. 8d. The structure of LHCII,resolved at 3.4 Å (6), features two carotenoids, seven Chls a, and five Chls b as light-absorbing agents. LHCII is located within the thylakoidmembrane in the vicinity of PS-II. It has been suggested that LHCII can, according to light conditions, physically move toward PS-I, regulatingthereby the relative flow of energy into PS-II and PS-I.

The multiprotein photosynthetic apparatus as shown in Fig. 1 poses the challenge for eventually modeling the conversion of light into ATPin its entirety. Few would have predicted that the protein constituents of the photosynthetic apparatus would be structurally known in principlealready today, but many expect that biologists will see more and more often entire protein systems engaged in complex overall functionsresolved at atomic resolution. The questions posed by the photosynthetic apparatus will then be typical for biology of the 21st century: how aremultiprotein systems genetically controlled, how do they physically aggregate, how did they evolve, and how do they compare betweenspecies? The PSU constitutes an ideal subsystem of the photosynthetic apparatus that, because of its smaller size, is more amenable to studywhile posing the same principal challenges: how do LH-I and LH-II form from their many independent components, what determines the ringsize and stability, and how do the completed LH-IIs aggregate around the LH-I-RC complex? The function of the PSU emerges as a true systemproperly, all components being designed to cooperate in absorbing light effectively and channel its energy to the RC. The common origin ofphotosynthetic, respiratory, and other organisms makes the PSU and the photosynthetic apparatus a valuable model for understanding, at thelevel of multiprotein systems, not only photosynthesis but also life in general.

We acknowledge financial support from the National Institutes of Health [Grant P41RR05969], the National Science Foundation [GrantsNSF BIR 9318159 and NSF BIR-94–23827(EQ)]. and the Carver Charitable Trust.1. Emerson, R. & Arnold, W. (1932) J. Gen. Phvsiol. 16, 191–205.2. Hu, X. & Schulten, K. (1997) Physics Today 50, 28–34,3. Duysens, L.N.M. (1952) Ph.D. thesis (Utrecht, The Netherlands).4. Cogdell, R., Fyfe, P., Barrett, S., Prince, S., Freer, A., Isaacs, N., McGlynn, P. & Hunter, C. (1996) Photosvnth. Res. 48, 55–63.5. Krauss, N., Schubert, W.-D., Klukas, O., Fromme, P., Witt, H.T. & Saenger, W, (1996) Nat. Struct. Biol. 3, 965–973,6. Kühlbrandt, W., Wang, D.-N. & Fujiyoshi, Y. (1994) Nature (London) 367, 614–621.7. Zuber, H. & Brunisholz, R.A. (1991) in Chlorophylls, ed. Scheer, H. (CRC, Boca Raton, FL), pp. 627–692.8. Miller, K. (1982) Nature (London) 300, 53–55.9. Walz, T. & Ghosh, R. (1997) J. Mol. Biol. 265, 107–111.10. Monger, T. & Parson, W. (1977) Biochim. Biophys. Acta 460. 393–407.11. van Grondelle, R., Dekker, J., Gillbro, T. & Sundstrom, V. (1994) Biochim. Biophys. Acta 1187, 1–65.12. Germeroth, L., Lottspeich, F., Robert, B. & Michel, H. (1993) Biochemistry 32, 5615–5621.13. Aagaard, J. & Sistrom, W. (1972) Photochem. Photobiol. 15, 209–225.14. Pullerits, T. & Sundstrom, V. (1996) Acc. Chem. Res. 29, 381–389.15. Fleming, G.R. & van Grondelle, R. (1997) Curr. Opin. Struct. Biol. 7, 738–48.16. Deisenhofer, J., Epp, O., Miki, K., Huber, R. & Michel, H. (1985) Nature (London) 318, 618–624.17. Ermler, U., Fritzsch, G., Buchanan, S.K. & Michel, H. (1994) Structure 2, 925–936.18. McDermott, G., Prince, S., Freer, A., Hawthornthwaite-Lawless, A., Papiz, M., Cogdell, R. & Isaacs, N. (1995) Nature (London) 374, 517–521.19. Koepke, J., Hu, X., Münke, C, Schulten, K. & Michel, H. (1996) Structure 4, 581–597.20. Hu, X., Xu, D., Hamer, K., Schulten, K., Koepke, J. & Michel, H. (1995) Protein Sci. 4, 1670–1682.21. Hu, X. & Schulten, K. (1998) Biophys. J., in press.22. Gouterman, M. (1961) J. Mol. Spectrosc. 6, 138–163.23. Bailey, M., Schulten, K. & Johnson, J.E. (1998) Curr. Opin. Struct. Biol., in press.24. Karrasch, S., Bullough, P.A. & Ghosh, R. (1995) EMBO J. 14, 631–638.25. Humphrey, W.F., Dalke, A. & Schulten, K. (1996) J. Mol. Graphics 14, 33–38.26. Hu, X., Ritz, T., Damjanovi , A. & Schulten, K. (1997) J. Phys. Chem. B 101, 3854–3871.27. Frenkel, J. (1931) Phys. Rev. 37, 17–44.28. Knox, K. (1963) Theory of Excitons (Academic, New York).29. Zerner, M.C., Cory, M.G., Hu, X. & Schulten, K. (1998) J. Phys. Chem. B., in press.30. Dracheva, T.V., Novoderezhkin, V.I. & Razjivin, A. (1996) FEBS Lett. 387, 81–84.31. Hu, X., Xu, D., Hamer, K., Schulten, K., Koepke, J. & Michel, H. (1995) in Biological Membranes: A Molecular Perspective from Computation and

Experiment, eds. Merz, K. & Roux, B. (Birkhäuser, Cambridge, MA), pp. 503–533.32. Sauer, K., Cogdell, R.J., Prince, S.M., Freer, A., Isaacs, N.W. & Scheer, H. (1996) Photochem. Photobiol. 64, 564–576.33. Alden, R., Johnson, E., Nagarajan, V., Parson, W., Law, C. & Cogdell, R. (1997) J. Phys. Chem. B 101, 4667–4680.34. Ritz, T., Hu, X., Damjanovi , A. & Schulten, K. (1998) J. Lumin., 76–77, 310–321.35. Pullerits, T., Sundstrom, V. (1996) J. Phys. Chem. 100, 10787– 10792.36. Jimenez, R., Dikshit, S., Bradforth, S. & Fleming, G. (1996) J. Phys. Chem. 100, 6825–6834.37. Wu, H.-M., Reddy, N.R.S. & Small, G.J. (1997) J. Phys. Chem. B 101, 651–656.38. Shreve, A.P., Trautman, J.K., Frank, H.A., Owens, T.G. & Albrecht, A.C. (1991) Biochim. Biophys. Acta 1058, 280–288.39. Hess, S., Chachisvilis, M., Timpmann, K., Jones, M.R., Fowler, G.J. S., Hunter, C, N. & Sundstrom. V. (1995) Proc. Natl.. Acad. Sci. USA 92,

12333–12337.40. Visscher, K.J., Bergstrom, H., Sundstrom, V., Hunter, C.N. & van Grondelle, R. (1989) Photosynth. Res. 22, 211–217.41. Arnold, W. & Oppenheimer, J.R. (1950) J. Gen. Physiol 33. 423–435.42. Oppenheimer, J.R. (1941) Phys. Rev. 60, 158.43. Förster, T. (1948) Ann. Phys. (Leipzig) 2, 55–75.44. Damjanovi , A., Ritz. T. & Schulten, K. (1998) Phys. Rev. E., in press.45. Chadwick, B.W., Zhang, C., Cogdell, R.J. & Frank, H.A. (1987) Biochim. Biophys. Acta 893, 444–457.46. Hudson, B.S., Kohler, B.E. & Schulten, K. (1982) in Excited States, ed. Lim, E.C. (Academic, New York). Vol. 6, pp. 1–95.47. Tavan, P. & Schulten, K, (1987) Phys. Rev. B 36, 4337–4358.48. Dexter, D.L. (1953) J. Chem. Phys. 21, 836–850.49. Nagae, H., Kakitani, T., Katoh, T. & Mimuro, M. (1993) J. Chem. Phys. 98, 8012–8023.50. Ricci, M., Bradforth, S.E., Jimenez, R. & Fleming, G.R. (1996) Chem. Phys. Lett. 259, 381–390.51. Chang, W., Jiang, T., Wan, Z., Zhang, J., Yang, Z. & Liang, D. (1996) J. Mol. Biol. 262, 721–731.52. Hofmann, E., Wrench, P., Sharples, F., Hiller, R., Welte, W. & Diederichs, K. (1996) Science 272, 1788–1791.

ARCHITECTURE AND MECHANISM OF THE LIGHT-HARVESTING APPARATUS OF PURPLE BACTERIA 5941

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 103: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5942–5949. May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold find MabelBeckman Center in Irvine, CA.

Electrostatic steering and ionic tethering in enzyme- ligand binding:Insights from simulations

REBECCA C.WADE*, RAZIF R.GABDOULLINE, SUSANNA K.LÜDEMANN, AND VALÈRE LOUNNAS

European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, GermanyABSTRACT To bind at an enzyme’s active site, a ligand must diffuse or be transported to the enzyme’s surface, and, if the

binding site is buried, the ligand must diffuse through the protein to reach it. Although the driving force for ligand binding is oftenascribed to the hydrophobic effect, electrostatic interactions also influence the binding process of both charged and nonpolar ligands.First, electrostatic steering of charged substrates into enzyme active sites is discussed. This is of particular relevance for diffusion-influenced enzymes. By comparing the results of Brownian dynamics simulations and electrostatic potential similarity analysis fortriose-phosphate isomerases, superoxide dismutases, and β-lactamases from different species, we identify the conserved featuresresponsible for the electrostatic substrate-steering fields. The conserved potentials are localized at the active sites and are the primarydeterminants of the bimolecular association rates. Then we focus on a more subtle effect, which we will refer to as “ionic tethering.”We explore, by means of molecular and Brownian dynamics simulations and electrostatic continuum calculations, how salt links canact as tethers between structural elements of an enzyme that undergo conformational change upon substrate binding, and therebyregulate or modulate substrate binding. This is illustrated for the lipase and cytochrome P450 enzymes. Ionic tethering can provide a control mechanism for substrate binding that is sensitive to the electrostatic properties of the enzyme’s surroundings even when thesubstrate is nonpolar.

Conceptually, the process of ligand-protein binding may be considered to consist of the following consecutive steps: 1, diffusion of theligand to the entrance to the binding site on the protein surface; 2, diffusion of the ligand through the protein to the binding site; 3,rearrangement of the ligand in the binding site into its bound orientation (see Fig. 1). In some cases, the binding site is situated on the proteinsurface and diffusion through the protein is not necessary. Step 1 may involve diffusion in reduced dimensions or the ligand may be activelytransported to the protein surface. However, in general the above three steps should be considered and electrostatic interactions can influenceall three. Here, their role in the first two steps is considered.

The important influence of electrostatic interactions on the rates of diffusion of charged substrates toward the active sites of enzymes isnow well established (1–3). Electrostatic steering is of greatest importance for diffusion-controlled enzymes because it is one of the mainfactors determining the catalytic rate. For these enzymes, the postdiffusional steps of the reaction have been so optimized that the diffusionalassociation of substrate and protein has become the rate-limiting step. Enhancement of the diffusional association rates can be achieved byattractive electrostatic interactions between the substrate and the protein binding site. Here, we ask what enzyme features are necessary forelectrostatic steering resulting in rapid ligand-protein association rates. By examining orthologs from different species, by means of Browniandynamics (BD) simulations (4) and electrostatic potential similarity analysis, we identify common features important for their shared molecularfunction and find that these are confined to the close vicinity of the active site.

FIG. 1. Schematic diagram showing how electrostatic interactions can influence the binding of a ligand (shaded) to a protein(outline). Step 1, electrostatic forces and torques can steer the ligand into its binding site on the protein. Step 2, electrostaticinteractions such as salt links can affect the protein dynamics necessary for ligand access to binding sites shielded fromsolvent in “gated” binding. Step 3, electrostatic interactions, particularly salt links and hydrogen bonds, between ligand andprotein can contribute to binding affinity and specificity and to the structural binding mode of the complex formed.

The role of electrostatic interactions in the next two steps of the ligand-protein binding process is more complex. Here, we focus on onetype of interaction, which we shall term “ionic tethering.” This entails the formation of salt links between charged residues in the protein thataffect conformational changes in the protein associated with or necessary for ligand binding. Salt-link formation between charged groups in theligand and the protein can also contribute to ligand binding, but this will not be considered here. Instead, uncharged ligands, which cannotthemselves engage in salt links, will be examined. Nevertheless, we show the importance of electrostatic interactions for the binding ofnonpolar ligands and making binding sensitive to the surrounding environment. We

*To whom reprint requests should be addressed. e-mail: [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955942–8$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: BD, Brownian dynamics: BLAC. β-lactamase; SOD. superoxide dismutase; TIM, triose-phosphate isomerase.

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5942

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 104: (NAS Colloquium) Computational Biomolecular Science

discuss possible mechanisms in the light of two examples that have been the subjects of recent calculations.

ELECTROSTATIC STEERING

One way to identify the features important for the electrostatic steering of a substrate toward its binding site on an enzyme is by site-directed mutagenesis. This approach has been employed to examine the fast association rates of superoxide dismutase (SOD) andacetylcholinesterase with their substrates, and barnase with the inhibitor barstar.

• For human SOD, mutations were identified by using BD simulations to improve electrostatic steering, and, indeed, when themutations were made, greater electrostatic enhancement of the rate was observed (5). resulting in a “superperfect” enzyme (6).

• For acetylcholinesterase, a large number of mutations of charged residues was made, and these were shown to have little effect on therate of substrate binding (7). This result was interpreted as evidence of lack of diffusion control and electrostatic steering. However,the rates for the mutants could be well reproduced in BD simulations, which demonstrated enhancement of rates because ofelectrostatic steering of substrate toward and inside the substrate-binding gorge (8, 9).

• Barstar is the intracellular protein inhibitor of the extracellular ribonuclease barnase, and it binds very tightly with high on-rates. Evenso, the rate of binding could be improved by mutation (10). The effects of mutations and ionic strength on the association rates couldbe well reproduced by BD simulations (11). The data show the dominance of certain residues on the protein binding faces indetermining the electrostatic enhancement of the association rate.

Together, these results indicate that electrostatic enhancement of association rates arises mostly from the presence of a few chargedresidues close to the binding site.

Here, we take an alternative approach to site-directed mutagenesis, namely, comparison of diffusion-influenced enzymes from differentspecies to find out what is required for electrostatically enhanced substrate binding rates. By relying on natural evolution, we are assured ofexamining fully functioning enzymes although they may not be fully optimized for electrostatic enhancement of substrate on-rates or fastreaction, as this may not be desirable in their in vivo environment. We examine three families of diffusion-influenced proteins, triose-phosphateisomerases (TIM), Cu,Zn-superoxide dismutases (SOD), and class A β-lactamases (BLAC), for which crystal structures are available fromseveral organisms and whose kinetic properties have been measured (see Table 1).

Diffusional Control of Catalytic Rates. The primary indicator for diffusion control of an enzyme reaction is a fast catalytic rate that isdependent on the viscosity and ionic strength of the solvent. Both TIM and SOD are extremely fast, efficient enzymes with the rate-limitingstep of their reactions under physiological conditions being the diffusion of substrate, glyceraldehyde 3-phosphate and superoxide, respectively,to the active site, Indeed, TIM has been described as a “perfect enzyme” (12, 13). The catalytic rates measured for TIMs from more than fivespecies are all about 108 M–1·s–1 at 100 mM ionic strength (see ref. 14 and references in ref. 15), and viscosity dependence of the rates has beendemonstrated (16). The catalytic rate has been measured for SODs from more than eight species, and all have rates of about 3×109 M–1·s–1 at 20mM ionic strength (see references in ref. 17). The rates of SODs exhibit ionic strength dependence and decrease as the ionic strength increases(18). BLACs have been characterized as fully efficient enzymes with no single rate-determining step (19). They are partly diffusion-controlledfor good substrates, such as benzylpenicillin with a single negative charge, and most have catalytic rates of 107 to 108 M–1·s–1 for suchsubstrates at 100 mM ionic strength (20, 21).

BD Simulations. Experimental association rates were reproduced well for six variants of SOD (17) and four variants of TIM (15) by BDsimulation. These results show that the main features influencing the catalytic rates are represented in the simulation model. The protein isrepresented by all atoms observed crystallographically plus modeled polar hydrogen atoms, with each atom assigned a partial charge and a vander Waals radius. The protein is immersed in a uniform solvent continuum. The electrostatic potential of the protein is computed fromnumerical solution of the finite-difference linearized Poisson-Boltzmann equation (22). The substrate is represented by a charged sphere (forSOD) or dumbbell (for TIM). The molecules are treated as rigid, and intermolecular hydrodynamic interactions are neglected. Comparison ofsimulations with and without a net charge on the substrate show that electrostatic interactions enhance the association rates for all the enzymevariants studied.

Electrostatic Potential Similarity Analysis. To quantify the common features in the electrostatic potentials of different variants of theenzymes, we carried out an electrostatic po-tential similarity analysis. The members of each family of enzymes were superimposed bymatching α-carbons. Then

Table 1. Properties of the diffusion-influenced enzymes triose-phosphate isomerase (TIM), superoxide dismutase (SOD), and -lacktamase (BLAC)

TIM SOD BLACSubstrate Glyceraldehyde 3-phosphate Superoxide BenzylpenicillinNet charge of substrate, c –1/–2 –2 –1No. of protein variantscompared

4 6 4

Variants compared togetherwith Protein Data Baseidentifier code (listed inorder of increasing netcharge)

E.coli (ltre), yeast (lypi),chicken muscle.* T.brucei(5tim)

Spinach (Isrd), frog (Ixso),yeast (Isdy), human (Ispd),bovine (2sod), P.leiognathi(lyai)

E.coli (TEM-1) (Ixpb),B.licheniformis (4blm), S.albusG,† S.aureus (3blm)

Net charges of proteins atneutral pH. e

–12, –6, –2, +12 –8, –6, –4, –4, –2, +2 –6, –6, –4, +16

Range of sequence identitybetween proteins, %

�50 �30–55 �30–45

Measured kcatKm, M–1·s–

1×10–81.0–8.4 25–39 0.03–0.8

Ionic strength for ratemeasurement, mM

100 20 100

E.coli, Escherichia coli; T.brucei, Trypanosoma brucei; P.leiognathi, Photobacterium leiognathi; B.licheniformis, Bacillus licheniformis; S.albus, Streptomyces albus; S.aureus, Staphylococcus aureus.*Coordinates provided by P.Artymiuk.†Coordinates provided by O.Dideberg.

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5943

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 105: (NAS Colloquium) Computational Biomolecular Science

average potentials, sign conservation, and similarity indices, as defined in the legend to Fig. 2. were computed as a function of position aroundthe superimposed molecules. The results are shown in Fig. 2. The substrates considered are all negatively charged. The majority of the proteinsare also net negatively charged. Thus, electrostatic enhancement of the association rates does not arise from nonspecific attraction betweenmolecules due to monopole interactions. Instead, it arises from the nonuniform charge distribution of the proteins, which results in steering ofthe substrates toward the positively charged regions of the active sites.

FIG. 2. Electrostatic potential comparison for variants of TIM (Left). SOD (Center), and BLAC (Right). (Top) Averagepotential contoured at ±0.4 kcal·mol– l·e–1 (1 kcal=4.184 kJ). (Middle) Similarity index with most conserved regions withincontours at a level of 0.75 in all cases, except for the red contours in Center, which are at 0.85. (Bottom) Contours encloseregions where the sign of the electrostatic potential is conserved. In all cases, red represents regions of negative potential andblue represents regions of positive potential. Magenta solid spheres represent important active site-atoms: carboxylateoxygens of Glu-165 and amino nitrogen of Lys-13 in TIM. Cu and Zn ions in SOD, and the side chain of the catalytic Ser-70in BLAC. The proteins are represented by ribbon plots of representative variants: chicken muscle TIM. bovine (yellow) andP.leiognathi (green) SOD, and TEM-1 BLAC. The dimers of all SODs studied except that from P.leiognathi superimposewell on the bovine SOD. Consequently, one monomer of P.leiognathi was superimposed on one monomer of bovine SODinstead of the complete dimer as done for the other SODs. Negative contours are not shown for the sign conservation plot inSOD (Center Bottom) for clarity. The similarity index, SI, is computed at points (i,j,k) around the proteins from the followingformula, which is generalized to the comparison of N potentials. �1, l=1, 2,�, N, from the Hodgkin formula for thecomparison of two potentials (59):

SI=+1 when the potentials are all identical. SI=–1 when two potentials are opposite (N< 3). For small deviations, ∆�n, fromthe average potential, the decrease in the SI from its maximum (=1) is proportional to (∆�n)2. This is because the SI can berewritten as:

For example, when SI=0.85 for four potentials, The SI (and the average potential and sign conservation) arecomputed outside the molecules combined van der Waals volume as defined with atomic radii set at twice their normal values.

The TIM and SOD enzymes considered here are homodimeric enzymes with two active sites, whereas BLAC is a monomer with oneactive site. The average potentials (Fig. 2 Top) all show attractive positive regions of potential over the active sites. On average, theelectrostatic potential of the TIMs confine the substrate to a ring around the protein that includes both active sites. For BLAC, the positiveactive site potential is on average surrounded by a ring of negative potential and isolated from other positively charged regions of the proteinsurface.

The similarity index plots (Fig. 2 Middle) show the most conserved regions of the potentials for the variants of each

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5944

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 106: (NAS Colloquium) Computational Biomolecular Science

enzyme. The largest regions of positive potential are situated over the active sites for all enzymes. Regions of negative potential are alsoconserved away from the active sites but, as can be seen by comparison with the average potential maps, the potential in these regions isgenerally smaller in magnitude. Fig. 2 Bottom shows the regions where the sign of the potential is the same in all variants for each species. Thisgives an indication of the extent of the attractive potential acting on the substrate.

TIM In the TIMs, the positive potential at the active site is mostly because of the conserved active site Lys-13 and the conserved Lys-237(chicken muscle TIM numbering). This conserved region of positive potential has a volume of about 6,000 Å3 and extends about 30 Å from theamino group of Lys-13 at 100 mM ionic strength. The large pincer-like regions of conserved, but small in magnitude, negative potential aremostly due to Glu-23, which is conserved in all four TIMs examined but not all known TIM sequences.

SOD. In the SODs, the positive potential over the active site channel is due primarily to the copper and zinc ions, the conserved Arg-141(bovine SOD numbering), and a few other nearby positively charged residues that are not totally conserved. One of the SOD variants studied,that from the prokaryote Photobacterium leiognathi, displays a different dimerization mode from the other enzymes, which are eukaryotic (23).as shown in Fig. 2. It has a looser dimer interface than the other SODs, indicating that it may act partially as a monomer like the SOD fromE.coli whose structure was solved very recently (24). Nevertheless, the structures of the monomers of all SODs are similar, and thus wecompare the potentials around a superimposed monomer of each dimer. The region of attractive potential for the substrate is elongated, roughlyfollowing the shape of the active site cleft. It has a volume of �3,000 Å3 and extends up to 35 Å from the copper ion. While there is a commonregion of conserved positive potential as shown by the similarity index plot, the regions of positive potential in each of the proteins do notsuperimpose exactly. In particular, the P.leiognathi enzyme has a loop insertion known as the SSloop (green loop in the middle of Fig. 2 TopCenter) containing two lysine residues, an aspartic acid, and a glutamic acid, that may compensate for the deletion of the 7,8 loop that is presentin the other enzymes and contains one lysine and two glutamic acids. This increases the attractive potential near the SSloop (which is present toa lesser extent in the other SODs, as can be seen from the sign conservation map). It should also be noted that the magnitude of the positivepotential in the active site is greater for the P.leiognathi enzyme than for the bovine enzyme, although its rate is identical (25). Althoughmutations have not yet been reported of charged residues in the SSloop. several studies of mutations of the charged residues in the 7,8 loop(Asp-130, Glu-131. Lys-134) and the nearby Lys-120 (not conserved in prokaryotic SODs) and Arg-141 have been made. Arg-141 has beenshown to be particularly important electrostatically and mechanistically (26). The other residues are of lesser but significant importance for thecatalytic rate, but the relative importance of each of the residues has been shown to differ in the different variants (5, 27).

BLAC. In the BLACs, the conserved positive potential runs along the active site cleft and is due primarily to Lys-234, which is conservedin all four enzymes, and Arg-244, which is present in all but the Streptomyces albus G enzyme (ABL numbering scheme). However the S.albusG BLAC has an arginine not present in the other three enzymes, Arg-220, whose guanidino group occupies a very similar position in the three-dimensional structure to that of Arg-244 in the other enzymes. While this appears to be a largely compensatory mutation, and is present in otherBLAC sequences, it may be one of the reasons why the rate for the S.albus enzyme is lower than that of the other BLACs for benzylpenicillin.Lys-73, which is in the active site close to the catalytic Ser-70, is of less importance for the conserved attractive steering potential. The pKa ofthis residue is a subject of controversy (see ref. 28 and references therein), but here we note that neutralization of Lys-73 makes very littledifference to the conserved positive potential region shown near the active site in Fig. 2 Middle Right. Lys-234, Arg-220, and Arg-244 havebeen the subject of mutational studies in several BLACs (20). These have shown that Lys-234 plays a role in both the initial recognition of thesubstrate and in the stabilization of the transition state, with the latter being dominant. Mutation to a nonpolar residue of Arg-244 or Arg-220 inthe corresponding protein had a deleterious effect on the catalytic rate for charged substrates and indicated an important role for substratebinding. The conserved attractive potential region over the active site is much smaller for BLAC than for the other enzymes: it has a volume ofabout 200 Å3 and extends only 14 Å from the hydroxyl group of Ser-70.

Mechanistic Implications. Overall, these data show that the largest conserved regions of substrate-attracting electrostatic potential arenear the active sites in all of the enzymes studied. This fact implies that the localized potentials at the binding site are sufficient for efficientelectrostatic steering of substrate into the binding site. This is consistent with recent energetic analysis showing that the rate enhancement dueto electrostatic interactions can be approximately estimated from the Boltzmann weighted average of the interaction energy in the binding site (29–31). However, the conserved local potentials vary in size and extent between the enzymes studied. While the volume of conserved attractivepotential at the active site is twice as large for TIM as for SOD, it is much smaller for BLAC than for the other enzymes, indicating lesselectrostatic enhancement of rates, which is consistent with the lower measured rates for the BLACs. The local attractive potentials are largelyprovided by a few charged residues that are mostly highly conserved between orthologs. This permits enzymes with the same activity to havevery different net charges (ranging from –12 to +12 e for the TIMs and –6 to +16 e for the BLACs examined). Thus, they can be highlyefficient enzymes and fulfill different secondary functions or survive in different cellular environments.

IONIC TETHERING

Ionic Tethering Mechanisms. Ligand binding to proteins is frequently accompanied by conformational changes in the proteins such asloop motions, channel openings, or side-chain rotations. These may result in energy barriers to ligand binding and affect binding rates and theirtime dependence as described by gating theory (32, 33). We investigate the role of salt links as ionic tethers providing the protein with a meansof controlling such conformational “gating.” Some of the possible mechanisms by which ionic tethers might act are as follows:

• by thermodynamic stabilization of a conformation of the protein;• by kinetic stabilization of a conformation of the protein;• as devices to allow a certain degree of flexibility in the protein structure;• as devices dependent on pH. ionic strength, and dielectric of the environment; and• as devices to ensure specific interactions.

Protein folding is generally considered to be driven by hydrophobic interactions, and it is these interactions rather than interactionsbetween charged groups (whose desolvation is unfavorable) that are thought to stabilize the folded states of proteins (34, 35). Nevertheless,comparison of the crystal structures of a number of proteins from mesophilic and thermophilic organisms shows that the latter often containsignificantly more hydrogen bonds and salt links (36). More

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5945

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 107: (NAS Colloquium) Computational Biomolecular Science

over, calculations of the electrostatic free energy contribution to protein stability of salt links by using a classical continuum electrostatic model(34) show an overall tendency for salt links to be more stable in thermostable proteins than their mesostable counterparts (I. Shrivastava, V.L.,and R.C.W., unpublished data). Stabilization due to the formation of ionic networks is indicated by mutational data (37) and calculations (38)showing that removal of salt links by mutation of their side chains tends to be more destabilizing when they participate in salt-link networks.These data suggest that, under certain conditions, salt links can thermodynamically stabilize a folded conformation of the protein relative to itsfully or partially unfolded state. However, their increased presence in proteins from thermophilic organisms has also been attributed to“resilience” (39) or kinetic stabilization (40) which increases the kinetic barrier to unfolding. That is, if the protein is perturbed from itsequilibrium structure, the long-range nature of charge-charge interactions in salt links will facilitate return to the original structure. Charge-charge interactions can be thought of as providing a smoother energy landscape funnel than short-range hydrophobic interactions. Thus, theycan allow greater structural flexibility in proteins than hydrophobic interactions, although this is combined with greater specificity in the actualinteractions formed. Moreover, the strength of salt links is dependent on the physical properties of the environment. Thus, proteinconformational changes controlled by ionic tethers may be triggered because of a change in environment—e.g., pH, ionic strength, dielectricscreen-ing—that alters the strength of a salt link.

Consider two situations: one in which a salt link tethers and stabilizes an enzyme in its active conformation under certain environmentalconditions, and the other in which a salt link affects the opening of substrate-access and product-exit channels—i.e., gating perturbations fromthe equilibrium structure. In the former case, the ionic tether stabilizes one of several low-energy conformations. In the latter case, the ionictether acts as a control of structural deviations from a low-energy conformation. We will illustrate the former by interfacially activated lipasesand the latter by cytochrome P450s, which have buried active sites.

Lipase. Lipases catalyze the hydrolysis of uncharged ester substrates. Lipases undergo interfacial activation—i.e., their activity is greatlyincreased when they act on substrate at a lipid/water interface. Crystallographic studies (41–43) have shown that lipases possess a surface loopor “lid” over the active site that, upon activation, opens up to permit the binding of substrate. Upon opening, the lid’s hydrophobic face isexposed and its hydrophilic side is buried.

The properties of the active-site lid are probably best characterized in the lipase from Rhizomucor miehei. Crystal structures (41, 44) showthat the main difference between the open and closed forms is the displacement of a helical lid of about 12 residues (see Fig. 3). The lidcontains two charged residues, Arg-86 and Asp-91. In the open, inhibitor-bound form, Arg-86 is close to Asp-61. Evidence that these residuesform an ionic tether that stabilizes the open form relative to the closed form in certain environments is provided by the following theoreticaland experimental studies.

Molecular and Brownian dynamics simulations. The opening of the active-site lid has been simulated by molecular and Browniandynamics (45–47). The time scale of the opening means that the loop must be artificially guided from closed to open states during themolecular dynamics simulations, which give information about the relative energies of the open and closed states. On the other hand, BDsimulations, in which the lid is modeled as a simple chain of spherical residues rather than with an all-atom model, can be carried out on therelevant time scales and permit opening times to be estimated. The simulations show that opening of the lid is facilitated when the dielectricconstant and polarity of the environment are reduced, indicating that the open state is stabilized by electrostatic interactions. In BD simulations(46), the lid opens up in times on the order of 100 ns in a nonpolar low-dielectric medium, whereas the lid does not always open duringsimulations of 900 ns in a polar high-dielectric medium. BD simulations with Arg-86 and/or Asp-91 neutralized show their importance inactivation, with the effect of Arg-86 being dominant. In the closed inactive conformation, Asp-91 experiences repulsive forces that tend to pushthe lid toward the active, open conformation. On opening, Arg-86 approaches Asp-61 to make a favorable ionic interaction stabilizing the openconformation. The crystal structure of the open form shows that the side chain of Arg-86 is disordered (41). Thus the interaction with Asp-61does not result in the formation of a highly ordered hydrogen-bonded salt link but a less specific charge-charge interaction. The model in theBD simulations does not explicitly represent the side-chain atoms of Arg-86 and shows, therefore, that such less specific interactions canstabilize the open conformation of the lid.

FIG. 3. Part of the α-carbon trace of two crystal structures of the lipase from R.miehei showing the positions of the helical lidin open (black) and closed (gray) forms of the enzyme. In the crystal structure of the open form of the enzyme, an inhibitor isbound in the active site. All non-hydrogen atoms are shown for selected titratable residues (numbered) involved inelectrostatic interactions affecting the position of the helical lid. In the open form, Arg-86 in the lid is close to Asp-61.

Chemical modification and inhibition assays. Experimentally, the activity of the R.miehei lipase has been shown (48) to be reduced bychemical modification of arginines and the addition of guanidine before substrate. Chemical modification was shown to be greater for Arg-86than for any other arginine in R.miehei. Inhibition by guanidine was not observed when the guanidine was added after addition of substrate,indicating that arginine residues are important only during activation. Inhibition experiments with guanidine also showed reduced activity forHamicola lanuginosa and porcine pancreas lipases, although the reduction was smaller than for R.miehei lipase (70–88% vs. 26% residualactivity) (48). Both these enzymes have arginine residues in the active-site lids, although at different positions from Arg-86 in the R.mieheilipase. No reduction in activity was observed for lipases without arginine

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5946

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 108: (NAS Colloquium) Computational Biomolecular Science

in the lid—e.g., Candida rugosa. The H.lanuginosa lipase has a glutamic acid (Glu-87) at the equivalent position to Arg-86 in the R.mieheilipase. Molecular dynamics simulations (47) indicate that this residue will make unfavorable electrostatic interactions in the open state that canbe removed by neutralizing it. However, experiments show that the Glu-87→Ala mutant has decreased activity (35–70% residual activity)compared with wild type (49, 50). A possibility is that Arg-84 in the H.lanuginosa lipase approximately counters the effects of Glu-87 andmakes favorable ionic interactions with the rest of the protein in the open state.

Role of ionic tethering in lipases. There are not strictly conserved charged residues in the lids that act as ionic tethers stabilizing the openform in the presence of a low-dielectric lipid and the closed form in high-dielectric aqueous solution. The lid residues, however, contribute tothe different substrate specificities in different lipases. Thus, with the dual requirements of different substrate specificities and interfacialactivation, lipases appear to have evolved alternative arrangements of charged residues (often arginines) in the lid to act as ionic tethers tocontrol interfacial activation. Stabilization of the active lipase conformation is brought about, not by the formation of strong hydrogen bondsbut by longer range precise charge-charge interactions, which may permit the enzyme more flexibility for efficient turnover. Such ionic tethersenable the binding of nonpolar substrates by the lipases to be sensitive to the electrostatic properties of its environment and contribute to thephenomenon of interfacial activation.

Cytochrome P450. Crystal structures show that the active site of cytochrome P450cam from Pseudomonas putida is buried in the protein,isolated from the solvent (51). Data for other cytochrome P450s (52) show that the active site is sometimes isolated from solvent andsometimes has an open channel to the active site lined on one side by a rather mobile F-G helix-loop-helix segment (see Fig. 4A). Clearly, inthe case of cytochrome P450cam, protein motions are necessary for the substrate, camphor, to enter the active site. The binding of camphor tocytochrome P450cam can be considered as a two-step process: camphor first diffuses from the outside of the protein to the binding site, and thenthere is a low- to high-spin transition at the heme iron. Experiments show that the equilibrium constant for the diffusion step, and theaccompanying enthalpy and entropy changes, are dependent on the dielectric constant and ionic strength of the surrounding solvent (53). Theequilibrium constant is less sensitive to these properties when Asp-251 is mutated to Asn. Asp-251 participates in a tetrad of salt links in thecrystal structure that connect the I helix, on which it sits, to the F helix (see Fig. 4A). This finding suggests that Asp-251 may influencesubstrate binding by participating in ionic tethers that regulate the opening and closing of the substrate access channel, which may involvemotion of the F-G helix-loop-helix segment of the protein.

Electrostatic calculations of salt-link stability. To obtain an indicator of the energetic cost of perturbing the salt links to Asp-251, wecomputed their electrostatic contribution to protein folding stability by using a classical electrostatic continuum model (38). On average, thesalt links in cytochrome P450cam [and in other proteins for which calculations have been done (34)], are neither stabilizing nor destabilizing.However, we found that the salt links to Asp-251 are exceptionally stable. The only salt links that were more stable were those to thepropionate groups of the heme. This suggests that cytochrome P450cam has evolved particularly stable salt links to perform functional roles:keeping the heme group bound and regulating the opening and closing of the substrate/product access channel to the active site.

Thermal pathway analysis and molecular dynamics simulation. To probe the conformational changes for substrate access to and exit fromthe active site more explicitly, we performed two types of analysis: thermal pathway analysis and molecular dynamics simulation (54). Inthermal pathway analysis, the measured temperature factors in the crystal structures are analyzed to identify flexible regions where ligandchannels may open (55). This analysis for cytochrome P450cam indicates three particularly mobile regions of the protein as candidates forligand channels. Similar regions were located as exit channels for expulsion of camphor from the binding site during the molecular dynamicssimulations. Representative trajectories for each of the three main channels are shown in Fig. 4B. These simulations were performed for timesof approximately 100 ps. This is orders of magnitude less than the time it would actually take for camphor to escape from the active site.Therefore, simulations were performed with an additional artificial randomly oriented force applied to camphor to improve its sampling andenable it to find an exit channel in a short simulation time (54). The simulations show that perturbation of the salt links to Asp-251 is notnecessary for expulsion of substrate from the active site, as their geometry is perturbed in only about half the trajectories generated.Surprisingly, relatively small and localized displacements of protein atoms, involving �0.5– to 2–Å shifts of backbone atoms and rotation

FIG. 4. Ribbon diagram of the crystal structure (51) of cytochrome P450cam with the buried heme and camphor substrateshown in bold. (A) The salt-link tetrad of residues involving Asp-251 is shown in bold. The region where a channel has beenproposed, on the basis of crystallographic data (51, 60), to open up to allow ligand access to the active site is indicated. Thischannel is lined by aromatic residues whose side chains are shown (Tyr-96, Phe-87, and Phe-193). (B) Three representativecamphor exit pathways derived by molecular dynamics simulation (54) are shown by thick lines that follow the position ofthe center of mass of the camphor as it escapes from the active site during the trajectories. The other trajectories simulated areclustered in the vicinity of each of these trajectories.

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5947

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 109: (NAS Colloquium) Computational Biomolecular Science

of a few side chains, are sufficient to permit camphor to escape from the protein.Role of Asp-251 ionic tethers in substrate binding. What then is the reason for the dependence of camphor binding rates on salt links to

Asp-251? Comparison of cytochrome P450cam with other cytochrome P450s indicates considerable flexibility in the F-G loop and the openingof a channel next to it as shown in Fig. 4A. This channel is the most often observed of the three classes of exit channel identified in themolecular dynamics simulations. Exit of camphor here during the simulations usually involves either perturbation of the F-G loop and Phe-193or perturbation of the B' helix with rotation of Phe-87. Further evidence for this ligand channel is data from site-directed mutagenesis, whichshows that mutation of Phe-87 to Trp reduces the camphor on-rate, mutation of Phe-193 to Cys increases the off-rate, and the introduction ofCys-Cys tethers to reduce the dynamics of the G helix affects on- and off-rates (56). A possible mechanism to explain the involvement ofAsp-251 is that the salt links to Asp-251 regulate slower breathing motions of the protein by tethering the F-G helix-loop-helix to the I helix.These breathing motions allow for opening of the access channel, thus making the entrance and exit of substrate via the channel next to the F-Gloop preferred over passage through the other channels identified in the calculations. Evidence for the modulation of the general dynamics ofthe protein by the salt links to Asp-251 also comes from photoacoustic calorimetry measurements of CO rebinding in cytochrome P450cam (57).

Role of ionic tethers in cytochrome P450s. Although residue 251 is conserved as Asp—or Glu—in cytochrome P450s, the salt links toAsp-251 are not conserved in the cytochrome P450s whose structures are available. However, in cytochrome P450cryf, Arg-185 in the G helixmakes hydrogen bonds to the protein core that could affect the dynamics of the F-G flap in an analogous fashion to the salt-links to Asp-251 incytochrome P450cam. This observation suggests that the control of protein dynamics permitting substrate access to the active site is tuned ineach cytochrome P450 according to the particular substrates it acts upon and the efficiency and regio- and stereoselectivity required. Incytochrome P450s that bind large substrates, it may be possible to achieve sufficient desolvation of the catalytic site without the need to isolatethe active site from the solvent. In cytochrome P450cam, a mechanism to bury the active site is likely to be crucial to its ability to catalyze ahighly regio- and stereospecific reaction with remarkably little uncoupling side-reactions.

Insights into Ionic Tethering. The present examples demonstrate a role for ionic tethers in stabilizing an active conformation of anenzyme (lipase) under certain environmental conditions and regulating deviations from an enzyme’s (cytochrome P450’s) equilibrium structureaffecting substrate binding. They make ligand binding sensitive to the electrostatic properties of the protein’s surroundings even when theligand is nonpolar. The preceding example shows that the way in which ionic tethers affect ligand binding may be rather subtle. Howwidespread the phenomenon of ionic tethering is in proteins remains to be investigated. Ionic tethers are not involved in all major proteinconformational transitions on ligand binding, but they can play a role in the binding of charged as well as uncharged ligands. For example, forsulfatebinding protein, it has been observed that two salt links affect kinetic dissociation rate constants by stabilizing the closed liganded formand modulating the rate of cleft opening (58). Further experimental and theoretical studies are necessary to fully understand the mechanisms ofionic tethering and its relation to ligand binding to proteins.

We thank Drs. P.Artymiuk and O.Dideberg for provision of coordinate sets. This work was partially supported by the European Union(Biotech CT94–2060). S.K.L. acknowledges an ErwinSchrödinger Fellowship granted by the Austrian Fonds zur Förderung derWissenschaftlichen Forschung (JO1379-CHE).1. Davis, M.E., Madura, J.D., Sines, J., Luty, B.A., Allison, S.A. & McCammon, J.A. (1991) Methods Enzymol. 202, 473–497.2. Tan, R.C, Truong, T.N., McCammon, J.A. & Sussman, J.L. (1993) Biochemistry 32, 401–403.3. Wade, R.C. (1996) Biochem. Soc. Trans. 24, 254–259.4. Madura, J.D., Briggs, J.M., Wade, R.C. & Gabdoulline, R.R. (1998) in Encyclopedia of Computational Chemistry, eds. von Rague Schleyer, P.,

Allinger, N.L., Clark, T., Gasteiger, J., Kollman, P.A. & Schaefer, H.F. (Wiley, Chichester, U.K.), in press.5. Getzoff, E.D., Cabelli, D.E., Fisher, C.L., Parge, H.E., Viezzoli, M.S., Banci, L. & Hallewell, R.A. (1992) Nature (London) 358, 347–351.6. McCammon, J.A. (1992) Curr, Biol. 2, 585–586.7. Shafferman, A., Ordentlich, A., Barak, D., Kronman, C., Ber, R., Bino, T., Ariel, N., Osman, R. & Velan, B. (1994) EMBO J. 13, 3448–3455.8. Antosiewicz, J., McCammon, J.A., Wlodek, S.T. & Gilson, M.K. (1995) Biochemistry 34, 4211–4219.9. Antosiewicz, J., Wlodek, S.T. & McCammon, J.A. (1996) Biopolymers 39, 85–94.10. Schreiber, G. & Fersht, A.R. (1996) Nat. Struct. Biol. 3, 427–431.11. Gabdoulline, R.R. & Wade, R.C. (1997) Biophys. J. 72, 1917–1929.12. Albery, J.W. & Knowles, J.R. (1976) Biochemistry 15, 5631–5640.13. Knowles, J.R. (1991) Nature (London) 350, 121–124.14. Albery, W.J. & Knowles, J.R. (1976) Biochemistry 25, 5627–5631.15. Wade, R.C., Gabdoulline, R.R. & Luty, B.A. (1998) Proteins, in press.16. Blacklow, S.C., Raines, R.T., Lim, W.A., Zamore, P.D. & Knowles, J.R. (1988) Biochemistry 27, 1158–1167.17. Sergi, A., Ferrario, M., Polticelli, F., O’Neill, P. & Desideri, A. (1994) J. Phys. Chem. 98, 10554–10557.18. Argese, E., Viglino, P., Rotilio, G., Scarpa, M. & Rigo, A. (1987) Biochemistry 26, 3224–3228.19. Christensen, H., Martin, M.T. & Waley, S.G. (1990) Biochem. J. 266, 853–861.20. Matagne, A. & Frere, J. (1995) Biochim. Biophys. Acta 1246, 109–127.21. Qi, X. & Virden, R. (1996) Biochem. J. 315, 527–541.22. Davis, M.E. & McCammon, J.A. (1989) J. Comput. Chem. 10. 386–391.23. Bourne, Y., Redford, S.M., Steinman, H.M., Lepock, J.R., Tainer, J.A. & Getzoff, E.D. (1996) Proc. Natl. Acad. Sci. USA 93, 12774–12779.24. Pesce, A., Capasso, C., Battistoni, A., Folcarelli, S., Rotilio, G., Desideri, A. & Bolognesi, M. (1997) J. Mol. Biol. 274, 408–420.25. Foti, D., Curto, B.L., Cuzzocrea, G., Stroppolo, M.E., Polizio, F., Venanzi, M. & Desideri, A. (1997) Biochemistry 36, 7109–7113.26. Fisher, C.L., Cabelli, D.E., Tainer, J.A., Hallewell, R.A. & Getzoff, E.D. (1994) Proteins 19, 24–34.27. Polticelli, F., Bottaro, G., Battistoni, A., Carri, M.T., DijnovicCarugo, K., Bolognesi, M., O’Neill, P., Rotilio, G. & Desideri, A. (1995) Biochemistry

34, 6043–6049.28. Raquet, X., Lounnas, V., Lamotte-Brasseur, J., Frere, J.M. & Wade, R.C. (1997) Biophys. J. 73, 2416–2426.29. Zhou, H.-X. (1996) J. Chem. Phys. 105, 7235–7237.30. Zhou, H.-X., Briggs, J.M. & McCammon, J.A. (1996) J. Am. Chem. Soc. 118, 13069–13070.31. Zhou, H.-X., Wong, K.-Y. & Vijayakumar, M. (1997) Proc. Natl. Acad. Sci. USA 94, 12373–12377.32. McCammon, J.A. & Northrup, S.H. (1981) Nature (London) 293, 316–317.33. Northrup, S.H., Zarin, F. & McCammon, J.A. (1982) J. Phys. Chem 86, 2314–2321.34. Hendsch, Z.S. & Tidor, B. (1994) Protein Sci. 3, 211–226.35. Wimley, W.C., Gawrisch, K., Creamer, T.P. & White, S.H. (1996) Proc. Natl. Acad. Sci. USA 93, 2985–2990.36. Vogt, G., Woell, S. & Argos, P. (1997) J. MoL Biol. 269, 631–643.37. Waldburger, C.D., Schilbach, J.E. & Sauer, R.T. (1995) Nat. Struct. Biol. 2, 122–128.38. Lounnas, V. & Wade, R.C. (1997) Biochemistry 36, 5402–5417.39. Aguilar, C.F., Sanderson, I., Moracci, M., Ciaramella, M., Nucci, R., Rossi, M. & Pearl, L.H. (1997) J. Mol. Biol. 271, 789–802.40. Pappenberger, G., Schurig, H. & Jaenicke, R. (1997) J. Mol. Biol. 274, 676–683.

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5948

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 110: (NAS Colloquium) Computational Biomolecular Science

41. Brzozowski, A.M., Derewenda, U., Derewenda, Z.S., Dodson, G.G., Lawson, D.M., Turkenburg, J.P., Bjorkling, F., HugeJensen, B., Patkar, S.A. &Thim, L. (1991) Nature (London) 351, 491–494.

42. Derewenda, U., Brzozowski, A.M., Lawson, D.M. & Derewenda, Z.S. (1992) Biochemistry 31, 1532–1541.43. van Tilbergh, H., Egloff, M.-P., Martinez, C., Rugani, N., Verger, R. & Cambaillau, C. (1993) Nature (London) 362, 814–820.44. Derewenda, Z.S., Derewenda, U. & Dodson, G.G. (1992) J. Mol. Biol. 227, 818–839.45. Norin, M., Olson, O.H., Svendsen, A., Edholm, O. & Hult, K. (1993) Protein Eng. 6, 855–863.46. Peters, G.H., Olsen, O.H., Svendsen, A. & Wade, R.C. (1996) Biopliys. J. 71, 119–129.47. Peters, G.H., Toxaerd, S., Olsen, O.H. & Svendsen, A. (1997) Protein Eng. 10, 137–147.48. Holmquist, M., Norin, M. & Hult, K. (1993) Lipids 28, 721–726.49. Holmquist, M., Martinelle, M., Berglund, P., dausen, I.G., Patkar, S., Svendsen, A. & Hult, K. (1993) J. Protein Chem. 12, 749–757.50. Martinelle, M., Holmquist, M., Clausen, I.G., Patkar, S., Svendsen, A. & Hult, K. (1996) Protein Eng. 9, 519–524.51. Poulos, T.L., Finzel, B.C. & Howard, A.J. (1987) J. Mol. Biol 195, 687–700.52. Hasemann, C.A., Kurumbail, R.G., Boddupalli, S., Peterson. J.A. & Deisenhofer, J. (1995) Structure 3, 41–62.53. Deprez, E., Gerber, N.C., Di Primo, C., Douzou, P., Sligar, S.G. & Hui Bon Hoa, G. (1994) Biochemistry 33, 14464–14468.54. Luedemann, S.K., Carugo, O. & Wade, R.C. (1997) J. Mol. Model. 3, 369–374.55. Carugo, O. & Argos, P. (1998) Proteins, in press.56. Sligar, S.G. (1995) in Cytochrome P450: Structure, Mechanism and Biochemistry, ed. Ortiz de Montellano, P.R. (Plenum, New York), pp. 83–124.57. Di Primo, C., Deprez, E., Sligar, S.G. & Hui Bon Hoa, G. (1997) Biochemistry 36, 112–118.58. Jacobson, B.L., He, J.J., Lemon, D.D. & Quiocho, F.A. (1992) J. Mol. Biol. 223, 27–30.59. Hodgkin, E.E. & Richards, W.G. (1987) Int. J. Quantum Chem.: Quant. Biol Symp. 14, 105–110.60. Raag, R., Li, H., Jones, B.C. & Poulos, T.L. (1993) Biochemistry 32, 4571–4578.

ELECTROSTATIC STEERING AND IONIC TETHERING IN ENZYME- LIGAND BINDING: INSIGHTS FROM SIMULATIONS 5949

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 111: (NAS Colloquium) Computational Biomolecular Science

Proc. Natl. Acad. Sci. USAVol. 95, pp. 5950–5955, May 1998Colloquium PaperThis paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J.Andrew

McCammon, and Peter G.Wofynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and MabelBeckman Center in Irvine, CA.

Computer simulations of enzyme catalysis: Finding out what hasbeen optimized by evolution

ARIEH WARSHEL† AND JAN FLORIÁN

Department of Chemistry, University of Southern California, Los Angeles, CA 90089–1062ABSTRACT The origin of the catalytic power of enzymes is discussed, paying attention to evolutionary constraints. It is pointed

out that enzyme catalysis reflects energy contributions that cannot be determined uniquely by current experimental approacheswithout augmenting the analysis by computer simulation studies. The use of energy considerations and computer simulations allowsone to exclude many of the popular proposals for the way enzymes work. It appears that the standard approaches used by organic chemists to catalyze reactions in solutions are not used by enzymes. This point is illustrated by considering the desolvation hypothesisand showing that it cannot account for a large increase in kcat relative to the corresponding kcage for the reference reaction in a solventcage. The problems associated with other frequently invoked mechanisms also are outlined. Furthermore, it is pointed out thatmutation studies are inconsistent with ground state destabilization mechanisms. After considering factors that were not optimized byevolution, we review computer simulation studies that reproduced the overall catalytic effect of different enzymes. These studiespointed toward electrostatic effects as the most important catalytic contributions. The nature of this electrostatic stabilizationmechanism is far from being obvious because the electrostatic interaction between the reacting system and the surrounding area issimilar in enzymes and in solution. However, the difference is that enzymes have a preorganized dipolar environment that does nothave to pay the reorganization energy for stabilizing the relevant transition states. Apparently, the catalytic power of enzymes is storedin their folding energy in the form of the preorganized polar environment.

Enzymatic reactions are involved in the acceleration and control of most biological processes. Thus, the understanding of the origin of theenormous catalytic power of enzymes is one of the important goals in molecular biology. Unfortunately, despite the enormous progress instructural and biochemical studies of enzymes, we still cannot use direct experiments to determine uniquely what are the most important factorsin enzyme catalysis. It is quite obvious that enzymes reduce the activation free energies of their reactions, but, as will be shown in this work, itdoes not follow that evolution can do “everything” and that all possible mechanisms (e.g., entropy, strain, dynamic effects, etc.) can provideeffective ways of catalyzing enzymatic reactions. Finding out what free-energy factors can help in catalysis is far from trivial because nocurrent experimental technique can provide direct correlation between the structure of an enzymesubstrate complex (ES) and the detailedcontributions to its transition state energy. Such correlation can be established, at least in principle, by using computer simulation approaches.

This work will address the general problem of enzyme catalysis and the importance of using energy-based considerations for resolving thisproblem. It will be pointed out that many proposals about the catalytic power of enzymes cannot be addressed in a meaningful way withoutusing the relevant thermodynamic cycles. This point will be illustrated by considering the desolvation hypothesis, showing that desolvationeffects do not provide a useful catalytic advantage. We will also review energy considerations and computational studies of other catalyticproposals. Special attention will be given to electrostatic energies, emphasizing that such contributions appear to account for the catalyticeffects of all of the enzymes that were examined by consistent computational studies. Finally, the nontrivial nature of the electrostatic catalysiswill be discussed. It will be pointed out that enzymes stabilize transition states more than water does because their active sites contain dipolesthat specifically have been ordered by the protein folding process. The presence of such preoriented dipoles in the enzyme active sites greatlyreduces the destabilizing contribution of the so-called “reorganization energy” to the enzyme transition state binding energy.

Establishing the Key Problem in Understanding Enzyme Catalysis. To address the nature of enzyme catalysis, it is crucial to analyzethe corresponding energetics in a clear way. In doing so, we start by considering the fact that most enzymes evolved to optimize kcat/Km. Asshown in Fig. 1, this evolutionary constraint is equivalent to the requirement of reducing ∆g�, which corresponds to the difference between theenergy of ES� and E+S. A part of this reduction can be accomplished by binding the parts of the substrate that are far from the reacting region,thus stabilizing ES and ES�, the activated complex, by the same amount This binding effect has been obvious for a long time (see, e.g., refs. 1and 2), but, in most instances, the binding contribution alone could not account for the large observed catalytic efficiencies of enzymes. It alsohas been obvious for >50 years (3, 4) that enzymes must reduce the activation barrier by interacting differently with the substrate in theES and ES� states. What was not clear and what is still one of the most fundamental problems in molecular biology is the actual mechanism forthe reduction of and whether this reduction involves the ground state destabilization or the transition state stabilization.

†To whom reprint requests should be addressed. e-mail: [email protected].© 1998 by The National Academy of Sciences 0027–8424/98/955950–6$2.00/0PNAS is available online at http://www.pnas.org.Abbreviations: OMP, orotidylic acid; ODCase, orotidine monophosphate decarboxylase; ES, enzyme-substrate complex.

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5950

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 112: (NAS Colloquium) Computational Biomolecular Science

FIG. 1. A free-energy profile along a reaction coordinate for the enzymatic reaction in the regime of the low substrateconcentration. The reaction involves the formation of the ES from the free enzyme (E) and substrate (S) in solution and theformation of the activated complex [(ES)�]. The symbols Km and kcat, are, respectively, the activation, bindingand apparent activation free energies, the equilibrium constant for the dissociation of the ES complex, and the second orderrate constant for the enzyme-catalyzed reaction. The symbol A is the pre-exponential factor in the expression for the rateconstant (5). The reason for using the lowercase “g” as a symbol for activation barriers is explained in ref. 5. The figuredemonstrates that ∆g� is independent of the magnitude of the ground state destabilization (∆∆Gb).

The difficulty of finding logical explanations for the reduction of led‡ to many proposals that can be eliminated in hindsight byconsidering the evolutionary pressure on enzymes that evolved to optimize kcat/Km. That is, as seen from Fig. 1, no ground state destabilization(∆∆Gb) will help to reduce ∆g�. Thus, it is not useful, at least from an evolutionary point of view, to use ground state destabilizationmechanisms. This point can be verified easily by going backward in evolution and considering the effect of mutations on ∆g�, kcat, and Km (seebelow).

Before examining which mechanisms work and which do not work, it is important to realize that many of the outstanding questions in thisfield cannot be resolved uniquely by current experimental approaches. That is. enzyme transition states can-not be isolated experimentally and,although indirect experiments are very valuable, they cannot be interpreted without some model for structure-function correlation. In addition,it is important to realize that the issue of enzyme catalysis is an energy issue, and, as such, it cannot be resolved without the ability of dissectingthe observed energy to the individual contributions. Finally, in analyzing the effect of enzymes on it is essential to focus on the properreference state, thus avoiding considerations of irrelevant factors. One of the most effective ways of doing so involves comparison of the givenassumed mechanism in the enzyme active site with the same mechanism in a solvent cage, where all of the reactants are at a contact distance(5) (Fig. 2). This definition allows one to avoid the rather trivial question associated with bringing the reactant to the same solvent cage (5) andto focus on the origin of the difference between kcat and kcage. In other words, such an analysis forces one to focus on the true reason for the factthat kcat is much larger than kcage. The rate constant kcage can be evaluated from experimental information about elementary reactions insolutions (5, 6) and/or ab initio calculations in solution (7, 8), but such studies are not practiced by most workers in the field, in part because ofthe difficulties in estimating the energetics of some reaction intermediates in aqueous solution and the frequent reluctance to ask quantitativequestions about energetics. Thus, in many cases, the discussion of the catalytic power of enzymes overlooks the most important question: Howlarge is the effect of the enzyme environment? Instructive works documented the large acceleration of the reaction rate in different enzymes (9,10) by comparing kcat/Km to the second-order rate constant in water. However, such a comparison includes the effect of the binding energy(∆Gbind) and does not tell us about the effect of the enzyme environment on For example, our recent analysis of the catalytic reaction ofribonuclease (T.M.Glennon and A.W., unpublished work) indicated that this enzyme provides the transition state stabilization, aslarge as �24 kcal/mol. This fact (which is not mentioned in the vast literature about ribonucleases) presents a major theoretical challengebecause it is hard to see how simple environmental effects can lead to such a large free-energy change. Trying to address such problemsquantitatively forces one to quantify the effects of different catalytic factors and to offer a concrete explanation for the overall reduction of

FIG. 2. A comparison of the free-energy profiles for an enzymatic reaction and for a reaction proceeding via an identicalmechanism in a reference solvent cage. The symbols E, S, and Saq designate the enzyme, the substrate, and the substrate inthe bulk solvent, respectively. The activation free-energy corresponds to the same reaction mechanism assumed for thegiven enzymatic reaction (i.e., it does not necessarily correspond to the actual reaction in solution). This can bedetermined by using experimental information for the related elementary reaction(s) or by using ab initio calculations.

‡Trying, for example, to explain the differential binding of the ground state and the transition state by van der Waals interactions canbe accomplished only by invoking the repulsive part of these forces. Consequently, these forces can be involved only in ground statedestabilization effects (which eventually were found to be inconsistent with the flexibility of proteins). As far as the van der Waalsattraction between the enzyme and substrate is concerned, it is very similar for the ground and transition states. This insensitivity to theexact structure is caused by the nature of the London dispersion forces that are approximately proportional to the number of interactingatoms. Similarly, hydrophobic forces cannot provide large differential binding for the ground and transition state of the reactingfragments. Finally, even electrostatic effects, which do contribute to the transition state stabilization, accomplish this stabilization in acomplex way (see below) that was not realized by early workers in the field.

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5951

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 113: (NAS Colloquium) Computational Biomolecular Science

Finding Out What Was Not Optimized by Evolution-Desolvation and Other Proposals. Numerous seemingly reasonable proposalsfor the origin of the catalytic power of enzymes have appeared in the literature (see, e.g., refs. 1 and 11–23). However, some of these proposalsturn out to be ineffective after one considers the relevant thermodynamic cycle and when one uses computer simulation approaches. As anexample of this point, we will first consider the desolvation hypothesis. This hypothesis, as introduced by Cohen et al. (13) and Crosby et al.(14), suggested that enzyme active sites become basically nonpolar after the removal of water molecules and that such nonpolar sites help inaccelerating enzymatic reactions. Realizing that a polar substrate would not bind in a nonpolar cavity, these authors pointed out that the bindingenergy of the nonreacting part of the substrate could be used as a driving force for the ground state destabilization by desolvation. Bycomparing activation energies measured for the reaction catalyzed by lyophilized hydrogenase in dry state and in solution (24), they estimatedthe desolvation contribution to kcat to be ≥103. Later, Jencks (1, 25) included the desolvation mechanism in the family of enzyme mechanismsthat are based on the ground state destabilization concept. The “geometric destabilization” (strain, substrate distortion) and “induceddestabilization” were considered to belong in the same class. The common denominator of these mechanisms was that they increased kcat bydestabilizing the ground state (ES in Figs. 1 and 2) for the given enzymatic reaction (1, 25). Recently, large desolvation contributions to kcatwere predicted by quantum mechanical calculations (26–29) that considered enzymatic reactions involving reactive ionized groups in theirground state. This finding led several research groups to the proposal that enzyme proficiency primarily is caused by the nonpolar enzymeactive sites evolved to stabilize the gas-phase transition states (26–29).

Although it is possible that small ground state destabilization accompanied with the strong binding of a distant part of a substrate could inprinciple increase kcat by up to three orders of magnitude, it is unclear how the ground state destabilization could lead to the increase of kcat/Kmand, consequently, to any evolutionary advantage. In addition to this general point, which pertains to all ground state destabilizationmechanisms, there are problems related specifically to the desolvation hypothesis. These points were established clearly by the quantitativeanalysis of the reaction profile of amide hydrolysis in gas-phase, in aqueous solution, and in the enzyme (30). Here, we reiterate generalproblems associated with the idea of catalysis by nonpolar enzyme active sites and illustrate these concepts for the particular case ofdecarboxylation of orotidylic acid (OMP) to uridylic acid by orotidine monophosphate decarboxylase (ODCase) (28) and the hydrolysis ofalkyl halides by haloalkane dehalogenase (29).

To explain the enormous proficiency of ODCase (10), which participates in the biosynthesis of pyrimidine nucleotides. Lee and Houksuggested that the enzymatic reaction involves a transformation from a ground state composed of a lysine +–OMP– ion pair to a neutraltransition state (Fig. 3) (28) “in a nonpolar enzyme environment” (28). In the absence of the experimental structure of this enzyme, the proposalof Lee and Houk was based on the results of the ab initio calculations that modeled the enzyme environment by a dielectric continuum model.To be more specific, they found that the decarboxylation reaction is barrierless in the gas phase and that the experimentally observed magnitudeof kcat is reproduced by a model that considers the enzyme as a uniform medium with a dielectric constant, ε=4 (Fig. 4). Consequently, theyconcluded that the ODCase works by providing a nonpolar environment for the decarboxylation reaction.

FIG. 3. The mechanism for the enzymatic decarboxylation of the OMP suggested by Lee and Houk (28). The ground andtransition states for this reaction are denoted as EH+S– and ESH�, respectively. The –NH3

+ and –NH2 groups belong to thecatalytic lysine residue in the hypothetical nonpolar active site of ODCase.

FIG. 4. An inconsistent free-energy diagram for the decarboxylation of OMP by ODCase in hypothetical active sitescharacterized by different values of the dielectric constant (ε), Note also that, in deriving this diagram, Lee and Houkmodeled the EH+S– ground state as EH+ +S– (infinitely separated ions).

Unfortunately, the energy of the EH+S– state in ε=1 and ε =4 is not identical at all to the corresponding energy at ε=78.§ For example, thegas-phase energy should be pushed up by the corresponding absolute value of the solvation free energy. Considering the uninteracting enzymeand substrate molecules in aqueous solution as a correct reference state (or simply using any single correct reference state), one obtains aqualitatively different reaction profile (Fig. 5). Here, the highest barrier corresponds to the formation of the R–NH3

+ +S– ion pair in a vacuum-like environment. This barrier and the related barrier at ε=2 reflect the fact that ion pairs are less stable in a nonpolar environment (32). Inaddition, the uncharged transition state (EHS�) has now the same energy in all three environments. Thus, the transition state stabilization ofFig. 4 disappears. Furthermore, the proposal of Lee and Houk (28) is undermined by the unrealistically low ε needed to obtain the experimentalkcat (Fig. 5) and the fact that the NH3

+ group will be deprotonated in such a nonpolar environment. This unprotonated alternative (NH2 +S–)reference state is denoted in Fig. 5 as E+S–. In fact, even the NH3

+S– ion pair will become a neutral pair (NH2SH) in a nonpolar environment(for the sake of simplicity, this neutral pair is not shown in Fig. 5). At any rate, the enormous catalytic effect found by Lee and Houk willdisappear once fully consistent thermodynamic and electrostatic considerations are invoked.

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5952

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 114: (NAS Colloquium) Computational Biomolecular Science

FIG. 5. A consistent free-energy diagram for the decarboxylation of OMP by ODCase. Note that the free energies of theground state (EH+S–) in each environment involve the contribution of the corresponding solvation free energies. [Themagnitude of this contribution was determined by the polarized continuum model (PCM HF/6–31G*) implemented in theGaussian 94 program (59). The default Pauling’s atomic van der Waals radii scaled by 1.2 were used. The calculatedsolvation free energies for orotate. CH3NH3

+, carbene-methylamine complex, and CO2 were –68, –76, –13.8, and –1.3 kcal/mol, respectively. For the EH+S– state, in which the orotate and CH3NH3

+ ions were assumed to lie 6 Å apart, solvation freeenergy was estimated by using the generalized Born formula, in which the gas phase interaction energy of –55 kcal/mol wasassumed. In addition, experimental binding free energy of the ODCase-OMP complex of –9 kcal/mol was taken intoaccount.] As is clear from the figure, the ground state energies are very different at different values of ε (in contrast to thefree-energy diagram presented in Fig. 4). Now the transition state energy is nearly identical in different environments, and theexperimental value of kcat of 16 kcal/mol is obtained for ε as low as 1.5 (this value is unrealistic and will not support anionized NH3

+ group).

Without having the crystal structure of this protein, we can only point out major inconsistencies in the model of Lee and Houk; we cannotshow how the enzyme stabilizes its transition state in this specific case. However, any single case that was examined by us (see also below) wasfound to involve transition state stabilization by a very- polar active site. In this respect, it is very Instructive to consider the pioneering work ofLienhard and coworkers (14), who introduced the desolvation idea in their study of pyruvate decarboxylase (14). These authors postulated thatthe active site of the enzyme must be nonpolar, and this view was later adopted (1) as a major support of the desolvation hypothesis. Recently,the crystal structure of pyruvate decarboxylase (33) revealed that its active site contains many polar residues. In particular, the substrate bindingsite is located in the region of the contact of the a and b domains of the protein and contains the Glu91 His92, Cys221, and Cys222 residues.Although a part of the thiamin cofactor for this reaction is surrounded by nonpolar amino acid residues, the activation of this cofactor occursvia the general base catalysis involving the water molecule and several polar residues. Moreover, the current mechanism of the substrateactivation by the cofactor (34), which is supported by the available structural information (33), differs from the mechanism using the formationof the neutral pyruvic acid in the nonpolar environment, as suggested by Crosby et al. (14). Although the overall pyruvate decarboxylationreaction involves several steps for which the structures of the relevant intermediates in the enzyme were not yet determined by x-raycrystallography, it is clear that the enzyme active site is very heterogeneous with the numerous preoriented polar and ionic groups.

Another theoretical study that invoked the desolvation hypothesis is a recent investigation of the SN2 displacement of Cl– from dichloroethane in the active site of haloalkane dehalogenase by Bruice and coworkers (29). In this case, unlike the ODCase discussed above, it waspossible to base the theoretical analysis on the known crystal structures (at 1.9- and 2.4-Å resolution, respectively) of this enzyme (35) and itscomplex with dichoroethane (36). The assumed reaction mechanism involves the nucleophilic attack of the ionized aspartate residue (Asp124)on the carbon atom of the substrate, which results in the displacement of the Cl– ion and formation of the alkyl ester intermediate (Fig. 6). In thenext reaction step, the product (chloro ethanol) is formed by the nucleophilic attack of the water molecule on the alkyl intermediate, recoveringthe aspartate residue. The nucleophilic Asp124 is, together with the nearby His289 and Asp260 residues, situated in a cavity that also containstwo Trp residues that form hydrogen bonds with the displaced chlorine ion. The structure of the corresponding transition state was calculatedby the semiempirical PM3 method for the reaction occurring in the gas phase, in a dielectric continuum (ε=80), and in the enzyme active site(29). The enzyme active site was modeled by 14 amino acids, including the Trp125 and Trp175 residues, and a catalytic water molecule. As inthe case of ODCase, the nonenzymatic reaction (37) was calculated to be extremely slow in aqueous solution, and the gas-phase reaction wasfound to be very fast. Moreover, the geometry and the relative energy of the transition state inside the enzyme active site was found to besimilar to the transition state energy and geometry obtained at the same computational level for the gas phase reaction. Lightstone et al. (29)concluded that the hydrogen bonds of two Trp residues are important for stabilizing the transition state [note that such hydrogen bonds are thedipoles considered in our early studies (5, 22)). They also pointed out the importance of small reorganization energy that will be discussedbelow. However, restoring the arguments of Dewar and Storch (26), Lightstone et al. (29) also concluded that the enzyme operates by adesolvation mechanism, destabilizing the reactants in a gas-phase-like environment.

FIG. 6. The proposed (29, 36) mechanism for the enzymatic dehalogenation of the dichloroethane. The ground and transitionstates for this reaction are denoted as ES and (ES)�, respectively. The –COO– group belongs to the catalytic Asp residue inthe active site of the haloalkane dehalogenase (29, 36).

Although the above proposal contains correct elements, it has several problems. First, a detailed examination of the active site ofhaloalkane dehalogenase reveals a very polar environment at the chemically relevant sites.¶ In fact, this environment is already entirely obviousfrom the inspection of the x-ray structure where two dipoles (two hydrogen bonds)

§Note that using macroscopic concepts in describing electrostatic effects in proteins is an inadequate approach (see. e.g., ref. 31). Forexample, the use of a uniform dielectric constant is inadequate as a measure of the polarity of the enzyme active site because an activesite containing fixed dipoles is polar, but its dielectric coast ant can be rather small. However. Lee and Houk (28) clearly meant the lowdielectric active site to be an homogeneous, nonpolar environment analogous to the nonpolar solvents used by organic chemists toaccelerate chemical reactions. Here, we will invoke macroscopic concepts (i.e., dielectric constants) just to show that enzyme activesites do not use such an environment.

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5953

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 115: (NAS Colloquium) Computational Biomolecular Science

are provided by Trp125, Trp175, two dipoles by the main chain peptide bonds of Glu56 and Trp 125, and two by the end of the α-helix (36).Second, the same energy considerations applied above for ODCase can show that the desolvation hypothesis does not work in the present case.To be more specific, if the active site of haloalkane dehalogenase really was nonpolar, the nucleophile Asp 124 would not be ionized in theground state. An attack by the neutral Asp on the dichloro ethane [not considered by Lightstone et al. (29)]—which is, in principle, possible—will involve a zwitterionic transition state, and, as such, it will proceed more slowly in the nonpolar than in the polar environment. Third, thelow activation barrier calculated for the gas-phase reaction reflects the assumption that the ground state of this system involves the ionizedAsp124. Although Asp124 probably is ionized in the real enzyme, using an ionized nucleophile in the gas-phase calculation amounts to startingthe reaction from a high energy intermediate (30). Finally, the correct inclusion of some of the active site dipoles (e.g., Trp125 and Trp175dipoles) amounts to a study of the given reaction in a polar environment and not in the gas phase. Saying that the enzyme works by using thegas-phase environment plus hydrogen bonds to stabilize the transition state is problematic. In the same way, one might include the entireenzyme explicitly while neglecting the solvent and state that the system works like a gas-phase system.

Apparently, the desolvation hypothesis reflects the realization that organic chemists can accelerate chemical reactions by moving themfrom polar to nonpolar solvents (thus, one would assume that enzymes can accomplish the same trick). However, what is missing in thisanalysis is the energy of moving the reactants from a test tube with a polar solvent to the test tube with a nonpolar solvent. This energycontribution is equivalent to the energy lost when the substrate is moved from water to the hypothetical nonpolar active site. Another point thatis missing is the lack of an evolutionary constraint to use binding energy in distant regions in destabilizing the reacting parts in their groundstate. To be more specific, one may design a catalytic antibody that will gain some increase in kcat by a desolvation mechanism, but enzymes donot have the incentive to do so (this mechanism will not help in reducing kcat/Km). Furthermore, as was mentioned above, enzymes cannotdestabilize the ionized form of natural amino acids and still use them as nucleophiles because such ionized groups will become unionized innonpolar sites.

In addition to the desolvation hypothesis, there are other mechanisms for reducing that can be excluded by using energyconsiderations and computer simulations. These include the following mechanisms: (i) the strain mechanism (1, 11). This mechanism involveda ground state destabilization caused by steric strain. However, force field calculations (5, 39, 40) have established that such a mechanismcannot account for a large reduction in (ii) The orbital steering mechanism (16). Such a mechanism requires that the approach of thereacting molecules is restricted to a very narrow angular range by a steep potential energy surface in the direction perpendicular to the reactioncoordinate. This proposal can be excluded once the actual energetics is estimated (5, 41). (iii) The low-barrier hydrogen bond mechanism (20,21). This mechanism implies that hydrogen bonds catalyze reactions by forming partial covalent bonds to the corresponding transition state.This proposal has been shown to be anticatalytic (relative to the corresponding regular hydrogen bond) by considering the relationship betweenthe solvation energy and charge delocalization (42). (iv) The idea that enzyme catalysis involves significant dynamic effects (17, 18) has beenexcluded by computer simulation studies (5, 43). Other proposals, such as the idea that ground state destabilization by entropic effects is theorigin of the reduction in (1, 12), cannot be excluded yet by computer simulation studies because of convergence problems. However,qualitative estimates (5) do not support this idea (see also ref. 44). Of interest, none of the suggestions that involve ground state destabilizationare supported by mutation experiments. That is, the ground state destabilization mechanisms require that at least some of the mutations thatchange the activity of the enzyme in a significant way will involve large ground state stabilization (this means that the native enzymes evolvedby using the specific group to induce ground state destabilization). However, mutations that lead to a large loss in catalysis (see, e.g., refs. 45–47) involve mainly large reduction of kcat accompanied by small changes in Km, (small changes in the ground state free energy) or in theincrease in Km and no change in kcat (equal destabilization of the ground and transition state).

Using Computer Modeling To Determine How Enzymes Really Work. As clarified in the previous section, it is possible to excludesome major catalytic proposals by simple energy considerations. Doing so, however, is not sufficient when one likes to determine the realorigin of enzyme catalysis. It seems to us that the only way of resolving this issue is to take crystal (or solution) structures of different enzymesand to reproduce the observed Once this is accomplished, it is simple to examine which energy contributions are responsible forthe overall effect. The selection of the proper computational strategy is not completely obvious. In principle, one can use the hybrid quantummechanical/molecular mechanics approach introduced by Warshel and Levitt (40). However, despite the recent popularity of this approach (48–50), it does not yet provide sufficiently quantitative answers. The problems are that (i) current semi-empirical molecular orbital models are notaccurate enough; (ii) most quantum mechanical/molecular mechanics approaches do not treat properly the complete enzyme substrateenvironment; and (iii) ab initio quantum mechanical/molecular mechanics approaches are too time consuming and, with the exception of oneapproach (51), do not involve calculation of activation free energies.

At present, the most effective and consistent way of using computer simulations in studies of enzyme catalysis is provided by theempirical valence bond method (5, 52, 53). This method does not try to evaluate from a first principle the energy surface of the substrate butrather focuses on the change in this free energy on moving from the reference solvent cage to enzyme active site. Thus, this method focusesdirectly on Extensive empirical valence bond studies of many enzymatic reactions have been reported in the literature (5, 6, 53–56). Many of the above simulation studies provided quantitative or semiquantitative results, reproducing frequently the overall catalytic effectof the enzyme. The most significant finding of all the above studies is that the largest catalytic effect always is associated with electrostaticcontributions. In other words, the electrostatic stabilization of the transition state is larger in the enzyme active site than in water.

The finding that enzymes provide large electrostatic stabilization is far from trivial and, in fact, seems at first sight to be inconsistent withall studies before the emergence of computer modeling. For example, studies of model compounds in solutions have not reproduced largeelectrostatic effects even with covalentty linked ionized groups that are aligned properly to stabilize ionic transition states (57). This fact can berationalized by saying that electrostatic effects cannot be large in aqueous solution because the dielectric constant is large in such anenvironment even at a short interaction distance (22). It thus can be argued that protein active sites with low dielectric constant should be ableto enhance electrostatic effects (58). However, ionized groups that were supposed to be the source of electrostatic effects in enzymes would notbe ionized in low dielectric sites. Of course, the argument that electrostatic effects are small in high dielectric environments and cannot exist inlow dielectric environments (which is the best that could have been concluded before the emergence of crystal structures of enzymes) is notcorrect. Protein active sites are neither homogenous low dielectric nor homogeneous high dielectric media. They are usually very polarheterogeneous sites (22, 31). This fact, however, cannot explain the finding that protein active sites provide larger stabi

¶The statement that the active site is polar may sound unreasonable in view of the fact that the active site was found to contain mostlyhydrophobic residues (35) However, considering the residue type without proper computational tools for structure-function correlationmay be quite deceiving. As is clear now to most workers who are involved in studies of electrostatic effects in proteins (see e.g., refs.31 and 38), even hydrophobic residues have very polar main-chain dipoles. Thus, the main-chain dipoles of hydrophobic groups andthe side-chain dipoles of a few selected residues are sufficient to create high polarity in the proper places. Thus, the decision whetherthe active site environment is polar should be determined by calculating the interaction energy of this environment with the particularsubstrate and not by counting amino acids.

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5954

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.

Page 116: (NAS Colloquium) Computational Biomolecular Science

lization to ionic transition states than water does. Here, the most puzzling problem is associated with the fact that the actual averageelectrostatic interaction between the transition states of an enzyme and the surrounding dipoles is not larger than the correspondinginteraction with the dipoles of the reference solvent cage The solution to this problem has been given before in a work (22) thatpointed out that, in polar solvents, about half of the charge-dipole interaction is spent on dipole-dipole interaction (<Vµµ>) so that

In proteins, however, a significant part of � ∆Vµµ� (or the corresponding reorganization energy) already is paid during the folding process,in which the folding energy is used to compensate for dipole-dipole repulsion and to align the active site dipoles in away that will maximize∆Gsol. The preoriented environment allows the protein to minimize the reorganization energy associated with the formation of the chargedtransition state (see Figure 9.7 of ref. 5).

The idea of preorganized dipoles as the source of the catalytic power of enzymes explains what really was done by evolution in optimizingenzymes. That is, the active site has to interact with the changes that occur in the substrate during the formation of the transition state to reduce

The structural changes of the substrate during the given reaction could not be used effectively because enzymes are rather flexible;changes in entropy were not useful because they can only help in ground state destabilization. Thus, the most effective way was to interact withthe changes in charges during the reaction. Here, a requirement of providing a large stabilization to the transition state charges forced theenzyme to be a better “solvent” for the transition state than aqueous solution (22), which can be accomplished by preorienting the active sitedipoles. Note that the resulting environment is exactly the opposite from the nonpolar environment envisioned as a source of catalytic energy inthe desolvation models above. Finally, the reason for the difficulty in realization that enzymes use preorganized dipoles can be traced in part tothe fact that the catalytic energy is already in the folding process and not in the enzyme substrate interaction.

This work was supported by National Institutes of Health Grant GM24492.1. Jencks, W.P. (1987) Catalysis in Chemistry and Biology (Dover, New York).2. Fersht, A.R. (1985) Enzyme Structure and Mechanism (Freeman, New York).3. Haldane, J.B.S. (1930) Enzymes (Longman, New York), p. 182.4. Pauling, L. (1946) Chem. Eng. News 24, 1375.5. Warshel, A. (1991) Computer Modeling of Chemical Reactions in Enzymes and Solutions (Wiley, New York).6. Fuxreiter, M. & Warshel, A. (1998) J. Am. Chem. Soc., 120, 183–194.7. Florián, J. & Warshel, A. (1997) J. Am. Chem. Soc. 119, 5473–5474.8. Florián, J. & Warshel, A. (1998) J. Phys. Chem. B 102, 719–734.9. Radzicka, A. & Wolfenden, R. (1996) J. Am. Chem. Soc. 118, 6105–6109.10. Radzicka, A. & Wolfenden, R. (1995) Science 267, 90–93.11. Phillips, D.C. (1966) Sci. Am. 215, 78–90.12. Page, M.I. (1977) Angew. Chem. Int. Ed. Engl. 16, 449–459.13. Cohen, S.G., Vaidya, V.M. & Schultz, R.M. (1970) Proc, Natl. Acad. Sci. USA 66, 249–256.14. Crosby, J., Stone, R. & Lienhard, G.E. (1970) J. Am. Chem. Soc. 92, 2891–2900.15. Menger, F.M. (1985) Acc. Chem. Res. 18, 128–134.16. Storm, D.R. & Koshland, D.E. (1970) Proc. Natl. Acad. Sci. USA 66, 445–452.17. Careri, G., Fasella, P. & Gratton, E. (1979) Annu. Rev. Biophys. Bioeng. 8, 69–97.18. Gavish, B. & Werber, M.M. (1979) Biochemistry 18, 1269–1275.19. McCammon, J.A., Wolynes, P.G. & Karplus, M. (1979) Biochemistry 18, 927–942.20. Frey, P.A., Whitt, S.A. & Tobin, J.B. (1994) Science 264, 1927–1930.21. deland, W.W. & Kreevoy, M.M. (1994) Science 264, 1887–1890.22. Warshel, A. (1978) Proc. Natl. Acad. Sci. USA 75, 5250–5254.23. Cha, Y., Murray, C.J. & Klinman, J.P. (1989) Science 243, 1325–1330.24. Yagi, T., Tsuda, M., Mori, Y. & Inokuchi, H. (1969) J. Am. Chem. Soc. 91, 2801–2807.25. Jencks, W.P. (1975) in Advances in Enzymology and Related Areas of Molecular Biology, ed. Meister, A. (Wiley, New York), Vol. 43, pp. 219–410.26. Dewar, M.J.S. & Storch, D.M. (1985) Proc. Natl. Acad. Sci. USA 82, 2225–2229.27. Dewar, M.J.S. & Dieter, K.M. (1988) Biochemistry 27, 3302–3308.28. Lee, J.K. & Houk, K.N. (1997) Science 276, 942–945.29. Lightstone, F.C, Zheng, Y.J., Maulitz, A.H.& Bruice, T.C. (1997) Proc. Natl. Acad. Sci. USA 94, 8417–8420.30. Warshel, A., Åqvist, J. & Creighton, S. (1989) Proc. Natl. Acad. Sci. USA 86, 5820–5824.31. Warshel, A. and Russell, S.T. (1984) Q. Rev. Biol. 17, 283–42132. Warshel, A. (1981) Biochemistry 20, 3167–3177.33. Arjunan, P., Umland, T., Dyda, F., Swaminathan, S., Furey, W., Sax, M., Farrenkopf, B., Gao, Y., Zhang, D. & Jordan, F. (1996) J. Mol. Biol. 256,

590–600.34. Metzler, D.F. (1977) Biochemistry (Academic, New York).35. Verschueren, K. R G., Franken, S.M., Rozenboom, H.J., Kalk, K.H. & Dijkstra, B.W. (1993) J. Mol. Biol. 1993, 856–872.36. Verschueren, K.H.G., Seljee, F., Rozenboom, H.J., Kalk, K.H. & Dijkstra, B.W. (1993) Nature (London) 363, 69–698.37. Maulitz, A.H., Lightstone, F.C, Zheng, Y.-J. & Bruice, T.C. (1997) Proc. Natl. Acad. Sci. USA 94, 6591–6595.38. Sharp, K.A. & Honig, B. (1990) Annu. Rev. Biophys. Biophys. Chem. 19, 301–332.39. Levitt, M. (1974) in Peptides, Polypeptides and Proteins, eds. Blout, E.R., Bovey, F.A., Goodman, M. & Lotan, N. (Wiley, New York), pp. 99–113.40. Warshel, A. & Levitt, M. (1976) J. Mol. Biol. 103, 227–249.41. Bruice, T. C, Brown, A. & Harris, D.O. (1971) Proc. Natl. Acad. Sci. USA 68, 658–661.42. Warshel, A. & Papazyait, A. (1996) Proc. Natl. Acad. Sci. 98, 13665–13670.43. Warshel, A., Sussman, F. & Hwang, J.-K. (1988) J. Mol. Biol. 201, 139–159.44. Lightstone, F.C & Bruice, T.C. (1996) J. Am. Chem. Soc. 118, 2595–2605.45. Wilks, R M., Hart, K.W., Feeney, R., Dunn, C R., Muirhead, H., Chia, W.N., Barstow, D.A., Atkinson, T., Clarke, A.R. & Holbrook, J.J. (1988)

Science 242, 1541–1544.46. Leatherbarrow, R.J., Fersht, A.R. & Winter, G. (1985) Proc. Natl. Acad. Sci. USA 82, 7840–7844.47. Carter, P. & Wells, J.A (1990) Proteins 7, 335–342.48. Waszkowycz, B., Hillier, I.H., Gensmantel N. & Payling, D.W. (1990) J. Chem. Soc. Perkin Trans. 2, 1259–1264.49. Bash, P.A., Field, M.J., Davenport, R.C, Petsko, G.A., Ringe, D. & Karphis, M. (1991) Biochemistry 30, 5826–5832.50. Mulholland, A.J., Grant, G.R & Richards, W.G. (1993) Protein Eng. 6, 133–147.51. Bentzien, J., Muller, R.P., Florián, J. & Warshel, A. (1998) J. Phys. Chem B, 102, 2293–2301.52. Warshel, A. & Weiss, R.M. (1980) J. Am. Chem. Soc. 102, 6218–6226.53. Åqvist, J. & Warshel, A. (1993) Chem. Rev. (Washington, D.C) 93, 2523–2544.54. Åqvist, J. & Fothergill, M. (1996) J. Biol. Chem. 271, 10010–10016.55. Fothergill, M., Goodman. M.F., Petruska, J. & Warshel, A. (1995) J. Am. Chem. Soc. 117, 11619–11627.56. Hwang, J.K. & Warshel, A. (1996) J. Am. Chem. Soc. 118, 11745–11751.57. Dunn, B.M. & Bruice, T.C. (1973) Adv. Enzymol. Relat. Areas Mol. Biol. 37, 1–60.58. Fife, T.H., Jaffe, S.H. & Natarajan, R. (1991) J. Am. Chem, Soc. 113, 7646–7653.59. Frisch, M.J., Trucks, G.W., Schlegel, H.B., Gill, P.M.W., Johnson, B.G., Robb, M.A., Cheeseman, J.R., Keith, T., Petersson, G.A., Montgomery,

J.A., et al. (1995) Gaussian 94, Revision C.2 (Gaussian. Pittsburgh).

COMPUTER SIMULATIONS OF ENZYME CATALYSIS: FINDING OUT WHAT HAS BEEN OPTIMIZED BY EVOLUTION 5955

Abou

t thi

s PD

F fil

e: T

his

new

dig

ital r

epre

sent

atio

n of

the

orig

inal

wor

k ha

s be

en re

com

pose

d fro

m X

ML

files

cre

ated

from

the

orig

inal

pap

er b

ook,

not

from

the

orig

inal

type

setti

ng fi

les.

Pag

e br

eaks

are

true

to th

e or

igin

al; l

ine

leng

ths,

wor

d br

eaks

, hea

ding

sty

les,

and

oth

er ty

pese

tting

-spe

cific

form

attin

g, h

owev

er, c

anno

t be

reta

ined

, and

som

e ty

pogr

aphi

c er

rors

may

hav

e be

en a

ccid

enta

lly in

serte

d. P

leas

e us

e th

e pr

int v

ersi

on o

f thi

s pu

blic

atio

n as

the

auth

orita

tive

vers

ion

for a

ttrib

utio

n.