exploration of sequence space for protein engineering

7
Exploration of sequence space for protein engineering Claes Gustafsson*, Sridhar Govindarajan and Robin Emig Maxygen Inc., Galveston Drive 515, Redwood City, CA 94063, USA The process of protein engineering is currently evolving towards a heuristic understanding of the sequence– function relationship. Improved DNA sequencing capacity, efficient protein function characterization and improved quality of data points in conjunction with well-established statistical tools from other industries are changing the protein engineering field. Algorithms capturing the heuristic sequence–function relationships will have a drastic impact on the field of protein engineering. In this review, several alternative approaches to quantitatively assess sequence space are discussed and the relatively few examples of wet-lab validation of statistical sequence–function characterization/correlation are described. Copyright # 2001 John Wiley & Sons, Ltd. Keywords: directed evolution; neural network; genetic algorithum; NK model; folding landscape; insilico evolution Received 11 June 2001; accepted 14 June 2001 INTRODUCTION A major goal of biotechnology is to harness and engineer complex biological systems by means of manipulating naturally evolved DNA and protein sequences. The incomprehensibly vast size of the hyper-dimensional sequence space is a continuing challenge for protein engineers seeking to traverse this space in search of biological systems with new or improved functions. These problems are not unique to protein engineering. Many other areas are successfully dealing with datasets with similar size and complexity, and have been doing so for a significant time. Typical examples include marketing, manufacturing, database providers, government, the travel industry, the banking and financial industry, telecommuni- cations and others. Examples can also be found within the biotechnology industry: pharmacogenomics, DNA arrays, HTP combinatorial-chemistry screening and clinical trials. In these and many other areas of interest, the amount of data is enormous, information is multi-dimensional and complex, and some or most of the data is incorrect or missing. Modeling these complex systems is a big challenge and any technology that reduces the noise in the system and makes better models using the vast reams of information collected will represent a competitive advan- tage. Development of generic data mining tools is a thriving industry today, encompassing more than 100 companies, ranging from IBM and the SAS Institute to Silicon Graphics and Oracle. Despite the major ongoing efforts in advanced data mining in all these fields, very little has been done using these approaches in protein engineering. The main reason for the limited development of multidimensional statistical tools for biopolymers has been the lack of large systematically varied datasets. We believe this is now changing. DATA MINING IN BIOLOGICAL DATASETS A plethora of tools have been developed to aid in identifying hidden patterns, predicting missing information and clustering samples according to information content. Tools such as neural networks, genetic algorithms, multiple linear regression and partial least square have all been successfully applied to complex and large sets of data irrespective of type (numerical, free text, categorical). Classic examples include customer credit rating in the banking industry or gasoline formulation in the petroleum industry. Only a handful of examples exist in the literature of quantitative sequence-pattern analysis for DNA and proteins, where quantitative measures such as biological activity can be evaluated in the context of sequence. The pre-requisite for this type of analysis in protein engineering is to have data that relates biological activity to protein/DNA sequences. Such data must be generated from systematically varied biological macromolecules with defined characteristics of interest. Until recently, the cost of generating such libraries has been prohibitive. In the absence of valid datasets to perform statistical evaluation, scientists have embraced two alternative formats: rational design and directed evolution. The rational design approach is based on access to the complete information of a very defined and limited macromolecular interaction of interest, typically relating function to three-dimensional structure and making predictive changes accordingly. Directed evolution, in its many different formats, relies on a ‘blind’ search of sequence space and the iterative JOURNAL OF MOLECULAR RECOGNITION J. Mol. Recognit. 2001; 14: 308–314 DOI:10.1002/jmr.543 Copyright # 2001 John Wiley & Sons, Ltd. *Correspondence to: C. Gustafsson, Maxygen Inc., Galveston Drive 515, Redwood City, CA 94063, USA. E-mail: [email protected] Abbreviations used: PAM, Point accepted mutation; HTP, high throughput; MIC, minimal inhibitory concentration.

Upload: claes-gustafsson

Post on 11-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Exploration of sequence space for proteinengineering

Claes Gustafsson*, Sridhar Govindarajan and Robin EmigMaxygen Inc., Galveston Drive 515, Redwood City, CA 94063, USA

The process of protein engineering is currently evolving towards a heuristic understanding of the sequence–function relationship. Improved DNA sequencing capacity, efficient protein function characterization andimproved quality of data points in conjunction with well-established statistical tools from other industriesare changing the protein engineering field. Algorithms capturing the heuristic sequence–functionrelationships will have a drastic impact on the field of protein engineering. In this review, severalalternative approaches to quantitatively assess sequence space are discussed and the relatively few examplesof wet-lab validation of statistical sequence–function characterization/correlation are described. Copyright� 2001 John Wiley & Sons, Ltd.

Keywords: directed evolution; neural network; genetic algorithum; NK model; folding landscape; insilico evolution

Received 11 June 2001; accepted 14 June 2001

INTRODUCTION

A major goal of biotechnology is to harness and engineercomplex biological systems by means of manipulatingnaturally evolved DNA and protein sequences. Theincomprehensibly vast size of the hyper-dimensionalsequence space is a continuing challenge for proteinengineers seeking to traverse this space in search ofbiological systems with new or improved functions. Theseproblems are not unique to protein engineering. Manyother areas are successfully dealing with datasets withsimilar size and complexity, and have been doing so for asignificant time. Typical examples include marketing,manufacturing, database providers, government, the travelindustry, the banking and financial industry, telecommuni-cations and others. Examples can also be found within thebiotechnology industry: pharmacogenomics, DNA arrays,HTP combinatorial-chemistry screening and clinical trials.In these and many other areas of interest, the amount ofdata is enormous, information is multi-dimensional andcomplex, and some or most of the data is incorrect ormissing. Modeling these complex systems is a bigchallenge and any technology that reduces the noise inthe system and makes better models using the vast reams ofinformation collected will represent a competitive advan-tage. Development of generic data mining tools is athriving industry today, encompassing more than 100companies, ranging from IBM and the SAS Institute toSilicon Graphics and Oracle. Despite the major ongoingefforts in advanced data mining in all these fields, verylittle has been done using these approaches in protein

engineering. The main reason for the limited developmentof multidimensional statistical tools for biopolymers hasbeen the lack of large systematically varied datasets. Webelieve this is now changing.

DATA MINING IN BIOLOGICALDATASETS

A plethora of tools have been developed to aid inidentifying hidden patterns, predicting missing informationand clustering samples according to information content.Tools such as neural networks, genetic algorithms, multiplelinear regression and partial least square have all beensuccessfully applied to complex and large sets of datairrespective of type (numerical, free text, categorical).Classic examples include customer credit rating in thebanking industry or gasoline formulation in the petroleumindustry. Only a handful of examples exist in the literatureof quantitative sequence-pattern analysis for DNA andproteins, where quantitative measures such as biologicalactivity can be evaluated in the context of sequence.

The pre-requisite for this type of analysis in proteinengineering is to have data that relates biological activityto protein/DNA sequences. Such data must be generatedfrom systematically varied biological macromolecules withdefined characteristics of interest. Until recently, the costof generating such libraries has been prohibitive. In theabsence of valid datasets to perform statistical evaluation,scientists have embraced two alternative formats: rationaldesign and directed evolution. The rational designapproach is based on access to the complete informationof a very defined and limited macromolecular interactionof interest, typically relating function to three-dimensionalstructure and making predictive changes accordingly.Directed evolution, in its many different formats, relieson a ‘blind’ search of sequence space and the iterative

JOURNAL OF MOLECULAR RECOGNITIONJ. Mol. Recognit. 2001; 14: 308–314DOI:10.1002/jmr.543

Copyright � 2001 John Wiley & Sons, Ltd.

*Correspondence to: C. Gustafsson, Maxygen Inc., Galveston Drive 515,Redwood City, CA 94063, USA.E-mail: [email protected]

Abbreviations used: PAM, Point accepted mutation; HTP, high throughput;MIC, minimal inhibitory concentration.

cycling of variation and selection until a desired solution isachieved. Both approaches have met with limited success.DNA shuffling was initially developed as a directedevolution method for searching high-quality sequencespace accessed by iterative recombination of pre-selectedfunctional diversity. The approach was able to overcomemany of the limitations of both rational design and otherdirected evolution formats and has proven to be exception-ally successful in optimizing biological systems. Severalreviews describing and comparing the alternative formatsfor protein design have been published (Tobin et al., 2000;Regenmortel, 2000).

Current advances in several different areas are nowpoised to enable the exploration of novel advancedquantitative database mining tools for protein engineering:

� The general mindset within biological research is rapidlymoving away from pursuing one parameter at the time toattempting to understand and describe large and complexsets of parameters. The old dogma of ‘One gene—onegraduate student’ is being replaced by ‘One genome—one graduate student’.

� The establishment of DNA shuffling and other combina-torial sequence space exploring methods have facilitatedthe generation of large systematically varied libraries ofsamples. The systematically varied output of the recom-bination events is crucial for efficient searches ofsequence space. The chimeric end products of a shufflingreaction are theoretically maximally distributed insequence space similarly to what is produced in a librarygenerated by experimental design.

� The cost of sequence characterization is continuouslydecreasing as a result of technology developed in thehuman genome sequencing projects. With the advent ofmodern capillary DNA sequencing instrumentation suchas Applied Biosystem’s 3700 and Amersham Pharma-cia’s MegaBace coupled with automated template prep-aration and efficient sequence management, it is nowfeasible to sequence 1000 clones or more from a libraryof protein variants. The cost and complexity willprobably continue to decrease as the technology developsfurther.

� The cost of activity characterization of samples continuesto decrease as many novel screening formats aredeveloped through miniaturization and automation.Much of the development in the screening technologyis being driven by the pharmaceutical and combinatorial-chemistry industries’ need for faster and cheaper evalua-tion of chemical libraries. The advent of proteomics asthe next step in functional genomics has provided severalmethods to characterize proteins in high throughput.

� A multitude of well-established statistical techniquesfrom other industries are available to analyze and use vastdatasets. Many industries have already gone through thesame evolution where accumulation of data is increasingexponentially and have developed highly advanced toolsto analyze the data.

This review attempts to provide a discussion of differentformats for protein engineering using sequence space as aconceptual framework. This theoretical basis can be usedto propose alternative solutions for how to navigate

through the hyper-dimensions of protein sequence in orderto identify the local or global optimum for any givenfunction/trait. This review will discuss recent publicationsthat have used experimental or theoretical approaches toreach an understanding of sequence space and to use thatunderstanding to explain or predict experimental data.

CONCEPT OF SEQUENCE SPACE

John Maynard Smith first introduced the notion ofsequence-space for protein evolution (Smith, 1970). Hecompared protein sequence evolution to a word game:passing from one word to another through a series ofintermediate words, which retain meaning (Fig. 1).Alphabets (available letters) correspond to amino acids,words (arrangements of those letters) correspond toproteins, and point mutations represent changes in theamino acid sequence of the protein with time. Duringevolution only changes that retain the function of theprotein or that produce advantageous effects are retained,and others which are deleterious are eliminated. Thisselection is analogous to a word game in which the wordshave to be from an English dictionary; the evolutionarypressure in protein evolution is the requirement to retainfunction in every step. During protein evolution bothneutral and advantageous mutations accumulate so twosequences that share a common ancestor decrease insimilarity with time. Smith proposed a ‘Sequence space’where all possible proteins are arranged in a protein space,where neighbors can be inter-converted by single pointmutations.

������ �� ��������� ��� ���� � ���� ��� ����� ������� ���������� ����� ������� ����� ���������� ����� ��� ���� �� ���������� ��� ����� � ������� �� ��� ��� ���� ��� ������� ���������� �� ����� �������� �� ��� ���� ��������� � �������������� ����� �������� ��� ������

EXPLORATION OF SEQUENCE SPACE 309

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314

In order to describe the evolutionary process mathema-tically, Sewall Wright (1932) introduced the concept of afitness landscape while studying the roles of mutations,inbreeding and crossbreeding in evolution. He proposedthat evolution could be considered as a walk in a ‘fitnesslandscape’ where the combinations of allele possibilitiesdefine the ‘sequence space’. Figure 2 represents his notionof sequence-function space. The regions marked (�)correspond to functional proteins and others are non-functional proteins. Mutations and recombinations thatresult in non-functional genes become dead ends, while theones that retain function or produce advantageous changesare accepted. Evolution then will be a walk along anetwork where all the sequences on the network meetselective pressures. Several researchers have extendedthese concepts and have studied in detail the nature ofsequence space and the evolving protein in ‘fitnesslandscapes’ (Eigen, 1971; Kauffman and Levin, 1987;Kauffman and Weinberger, 1989; Schuster and Stadler,1994; Govindarajan and Goldstein, 1997).

METRICS IN SEQUENCE SPACE

When defining any ‘space’, it is essential to describemetrics (distance measure) in that space. How do wemeasure the ‘distance’ between two sequences in asequence space? The protein space concept introduced bySmith (1970), though simplistic, is well studied. Proteinsare polymers, each protein consisting of a unique stringwith one of 20 possible monomers (residues) at eachposition in the string. Therefore, a protein of length N canbe represented as a point in protein space, which has 20Npoints. Each of the N residues in the protein can be mutated

to any of the other 19 possibilities and, in this space, eachpoint can be considered to have 19N neighbors. It is notpossible to represent this hyper-dimensional protein spacein two or three dimensions. Therefore we have to rely oneffectively defining the space using consistent metrics inorder to understand how to navigate this space. Onecommonly used distance metric is ‘Hamming distance’,defined as the minimal number of amino acid changesrequired to go from sequence A to B (Smith, 1970). Inevolutionary biologist terms this is the parsimoniousdistance between A and B. This is analogous to the ‘asthe crow flies’ distance in lay terms.

Another metric, more often used in the field of molecularevolution, is the PAM (point accepted mutation) distance.This metric is a measure of the ‘evolutionary distance’ andwas first formulated by Dayhoff and co-workers (Dayhoff and Eck, 1968). Dayhoff defined anaccepted point mutation as exchange of one amino acid foranother, accepted by natural selection. Each amino acid has adistinct physico-chemical property, but some of them shareaspects of properties that make them more ‘similar’. Thisresults in some mutations being accepted more frequentlythan others. Dayhoff and coworkers devised a matrix basedon empirical sequence data for the probability of transitionsbetween the amino acids. Two protein sequences can becompared and a score can be assigned using the Dayhoffmatrix that reflects the probability that these two sequencesshare a common ancestor. This metric is very useful instudying protein evolution and in understanding the nature ofselection pressure acting on sequences.

An intriguing visualization tool to display and quantitatethe metrics of the multidimensionality of sequence space isderived from the neural network field where self-organizingmaps such as Kohonen maps and Sammon projections are

������ � ������� ������������ �� � ����� ���! ��� ������� ������ ����������� � ������ "����� ��� �� ������� �� ������ ���� ���� �������� ������������ ��� ���������� ������� #������$������������� ��������� �� ��� ���� �� ���������

310 C. GUSTAFSSON ET AL.

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314

often used for unsupervised nonlinear cluster analysis.Multiple alignments as well as domains and structuralelements have successfully been incorporated into Kohonenmaps (Hanke and Reich, 1996), and to a lesser extentSammon projections (Agrafiotis, 1997). The advantage ofthis type of display and analysis tool for the subsequentapplication of statistical tools for analysis of function is thatthe metric is relational and inherent to the correlation of thedisplayed entities. The amino acids themselves can also becaptured in a number of different ways. There are at least adozen published scoring matrices for amino acids based ongenetic codes, physico-chemical properties, observed fre-quency of mutations, secondary structural matching andstructural properties (Johnson and Overington, 1993).Various attempts have been made to cluster amino acidsbased on these matrices, including dendrograms (Sneath,1966), Venn diagrams (Taylor, 1986), principal componentsanalysis (Sandberg et al., 1998), multi-dimensional projec-tion (Taylor and Jones, 1993), Sammon’s non-linearmapping (Agrafiotis, 1997) and artificial neural networkbased approaches (Wang et al., 1998).

A ‘tree’ diagram is traditionally the default visualizationtool used to represent relationships between a group ofproteins. The tree is usually a binary tree and reflects theevolutionary relationship between the sequences. This tree,although not strictly a phylogenetic tree, is normally based onthe PAM distance between sequences and the topologyindicates the most recent common ancestry between any pairof sequences. This is a useful way of representing a largenumber of sequences in sequence space but is restricted tosequences derived by point mutations. Chimeric sequencesresulting from recombination events rather than sequentialpoint mutation are not captured by phylogenetic-like trees forprotein sequences and thus cannot be meaningfully displayedin this way (Schierup and Hein, 2000). Increasing sequencedata from the completed genomes of prokaryotes andarchaebacteria indicates that inter- and intra-species hor-izontal gene transfer contributes significantly to proteinevolution in nature (Levin and Bergstrom, 2000; Ochmanand Moran, 2001). This suggests that, even in the study ofnaturally occurring proteins, recombination as an evolu-tionary mechanism may limit the relational informationbetween species and genes that can be extracted usingphylogenetic trees. In the laboratory, combinatorial recom-bination-based directed evolutionary schemes such as DNAshuffling, where each progeny is derived from an abundanceof parental sequences, completely obliterates the analyticalvalue of a phylogenetic tree to display and quantitate thesequence space metrics.

EVOLVING IN COMPLEXLANDSCAPE

In the previous sections we reviewed the concept of sequencespace. Our main motivation for defining sequence space is tounderstand how nature evolves protein sequences and howwe can apply molecular evolution to artificially evolve genesof interest. Which areas of sequence space are functional?This information would be very valuable if it allowed us tomake precise sequence–function correlations. The activity of

any protein is a function of its primary nucleotide sequence.The primary sequence encodes all information that isnecessary to perform the given function. Structure, alter-native splicing, post-translational modification etc. are allattributes of the primary sequence, either directly orindirectly. Our goal then is to map the sequence and itscorresponding activity so that we can alter the sequence in apreferred direction. Unfortunately, the nature of the land-scape is not simple. The fitness landscape is not one smoothpeak in sequence space, which can be climbed gradually bysimple hill-climbing approaches. The sequence space isenormous and dense and is impossible to explore exhaus-tively. The underlying landscape can be described by thedegree of correlation between variables defining thesequence (amino acids) and the function (activity). Acompletely uncorrelated landscape is impossible to exploreexcept by complete sampling. It is unlikely that proteinsequences occupy such a chaotic landscape. We know forexample from the empirically derived Dayhoff relationshipsthat there is some systematic tolerance of proteins for aminoacid substitutions. Equally, the difficulties and frequentfailures of structure-based protein engineering tell us thatproteins do not occupy a simply explored completelycorrelated landscape. Such simple landscape would assumethat most variables are either invariant or irrelevant. StuartKauffman proposed an intermediate model, ‘the NK model’(Kauffman and Weinberger, 1989), to describe any complexlandscape, where the complexity is a function of ‘depen-dency’ of variables that describe the space, and the constant Kis somewhere between 0 (completely correlated) and N-1(completely uncorrelated). His studies on adaptive matura-tion of the immune response show correlation and he hassince extended his theories to include other molecularsystems, populations, social and economical systems (Kauff-man, 1995).

By modeling fitness landscapes on an NK model, we canassume that sequence space within biological systems istractable, as is the case in other massive and complexdatasets. We can therefore explore the relationshipsbetween systematically varied proteins. By using suchmetrics as Hamming distance, Dayhoff matrix, PAMdistance or similar it should be possible to capturecorrelations between function and sequence. Modeling ofthe heuristically derived sequence–function correlationdata using any of a number of statistical tools cansubsequently be done. These models can be testedexperimentally. By identifying an empirically provenalgorithm correlating sequence with function for a givenprotein trait, the protein optimization problem would betransformed from a deterministic rational design problemin which detailed understanding of function is critical, to aheuristic rational design problem, obviating the need forpreconceived mechanistic models.

QUANTITATIVE SEQUENCE–ACTIVITY MODELING ONRESTRICTED SEQUENCE SPACE

Biological sequences (DNA, RNA or proteins) carry twocomplementary pieces of information, one related to the

EXPLORATION OF SEQUENCE SPACE 311

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314

absence of variation (conserved evolution), and the otherinformation based on systematic variance (sequence–activity relationship). Traditionally DNA or proteinsequence is analyzed by qualitative pattern recognition;classification is through homology (relatedness). This isroutinely done by aligning all sequences presumed to berelated and identifying patterns that are shared among all ormost of the related sequences. These common motifs canthen be used to determine qualitatively whether a newsequence belongs to the group or not. Sequence homologyhowever, does not carry any quantitative functionalinformation. For example, just because a DNA sequenceis identical to the consensus of E. coli transcriptionalpromoters does not imply it is the strongest promoter in thegroup. It is much more likely to be a promoter of averagestrength, since equal weight is given to sequence informa-tion from promoters of all strengths present in the set.

Multivariate characterization of biological macromol-ecules was first done on small synthetic penta-peptideswhere the model generated through principal componentanalysis and partial least squares was used to makepredictions within the defined sequence space (Hellberget al., 1986). The dataset consisted of 15 bradykininpotentiating peptides with modifications in all five posi-tions (Ufkes et al., 1978). Hellberg et al. identified threedescriptors for each amino acid derived from the principalcomponent of 20 properties for all 20 coded amino acids. Amatrix consisting of 15 rows (peptides) and 15 columns (5amino acids � 3 descriptors) was built and a partial leastsquares model generated using the 15 peptides as trainingset. Subsequent work by Ufkes et al. identified additional15 active and one inactive penta-peptide in the samesystem (Ufkes et al., 1982). The model derived from theinitial 15 peptides demonstrated significant predictivecapability when predicted vs observed activity wascompared for the second set of 16 peptides.

In a more recent example a 15 residue antibacterialpeptide was optimized by designing a training setcomprising 60 analogs generated through D-optimal design(Mee et al., 1997). The amino acid descriptors used werevery similar to those derived in the bradykinin example. Apartial least squares model derived from the measuredactivity of the training set was applied to identify a set of39 peptides with predicted high antibacterial potency.Within this set, the most potent analog was experimentallyshown to have a mean MIC value on a panel of 24 bacteriaof approximately half that of the initial lead 15-merpeptides i.e. twice as potent as the lead molecule. Seven ofthe 15 amino acids were changed in the optimized peptide.

Using similar methods, DNA sequence–activity relation-ships can be modeled as well. In a paper by Jonsson et al.,the partial least squares approach was applied to a set of E.coli transcriptional promoters (Jonsson et al., 1993). A setof 25 related promoters (here defined as a 68 basepairsfragment) had been generated and experimentally char-acterized in a series of papers from the laboratory ofBujard and co-workers (Brunner and Bujard, 1987; Knausand Bujard, 1988; Lanzer and Bujard, 1988). Thepublished data was used to build a partial least squaremodel capturing 85% of the information in the data.Jonsson and co-workers subsequently measured thetranscriptional activity of six representatives of the original

25 promoters in a defined system and used this data in apartial least squares model to predict two improvedpromoters extrapolated from the training set. These twopromoters were synthesized and measured in the samesystem. Both were determined to be roughly 1.5 foldstronger than the best promoter present in the reference set.The model here used the qualitative nucleotide assignmentsas descriptors instead of the physico-chemical propertiesused in the peptide examples above.

NEURAL NETWORKS AND GENETICALGORITHMS

Neural networks are often described as the second best wayto solve any problem; the best way being to know thesolution itself. As the typical neural network is based onrecognizing taught events and learning by experience, it isan attractive tool to apply to heuristically derivedsequence-function relationships. One of the first examplesof de novo protein design using neural networks was asimple genetic algorithm that performed point mutations,but not crossovers (recombinations). A modular neuralnetwork system employing physico-chemical sequencedescriptions was trained to recognize characteristic pepti-dase I cleavage site features. The network was used tosimulate recognition of a cleavage site in the evolvedpeptide. This was done by selecting a sample peptide,making random variations, screening the resultant libraryusing the training-derived fitness function to identify thebest variants for the next round and repeating the process(Schneider and Wrede, 1994). Although the method hasbeen the subject of controversy (Darius and Rojas, 1994),Paul Wrede and co-workers have since followed up thepredictive studies with experimental validation using thepeptidase I cleavage site as a model (Wrede et al., 1998).They tested two predicted sequences from the neuralnetwork/evolutionary search. In both cases the sequenceshad measured activities comparable to both the predictedactivity and the reference (known) sequence activity.Notably the neural network/evolutionary search startedwith random sequences and only used the trained neuralnetwork as a guide to find the final best sequences. In aparallel study a neural network was trained on the bindingof a test-set of 90 decamer peptides. The neural networkcould subsequently be used to predict peptides with betteractivity than the best in the training set, which wasconfirmed experimentally (Schneider et al., 1998).

Though the Wrede/Schneider method does ‘evolve’ apeptide, rather than search through all of sequence space,genetic algorithms with crossovers are more efficient atsearching complex landscapes and are aptly suited to thisparticular problem. A genetic algorithm based approachwas pursued in a pioneering study by Patel et al. to identifyantibacterial peptides (Patel et al., 1998). They usedgenetic algorithms and physico-chemical criteria for thefinal peptide to meet. Patel et al. followed the flowchart(Fig. 3) using neural networks and genetic algorithms todesign novel bactericidal peptides approximately 17 aminoacids long based on the input from 29 known measuredpeptides. They also compared using genetic algorithms,

312 C. GUSTAFSSON ET AL.

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314

Monte Carlo (similar to the initial Wrede/Schneidermethod), and a random method for searching sequencespace. They found that genetic algorithms were an order ofmagnitude more effective than Monte Carlo as measuredby identified predicted hits. Monte Carlo in turn was farbetter than a random search. After running the in-silicosearch, they chose five out of 400 positive sequences byanalyzing their principal components and maximizing fordiversity. All five peptides where measured and found tohave activities near the predicted values.

DNA SHUFFLING IS A GENETICALGORITHM

DNA shuffling has recently become popular due to itsability to rapidly evolve genes in a desired functionaldirection. The idea of recombination between genotypesand selection of the fittest individuals has been aroundsince the prehistoric dawn of agriculture. However, itwasn’t until 1975 that the underlying search algorithm of

breeding was understood in the context of an optimizationprocess in high dimensional sequence space. This was JohnHolland’s ‘Adaptation in Natural and Artificial Systems’,which outlined the concept of the genetic algorithm(Holland, 1975). Holland showed computationally, thatthis model of searching complex sequence-function spacewas not only very efficient, but also very robust. One cancreate thousands of variants on the computer, score them,select the best variants and iterate in minutes. This,combined with the robustness of problem solving thatgenetic algorithms offer, has caused the method to becomevery popular for searching the sequence space of manyproblems. Genetic algorithms have proven themselves in amultitude of very different computational optimizationproblems to be a superior algorithm for almost anyoptimization problem. With the advent of DNA shuffling(Stemmer, 1994), genetic algorithms came back togenetics. DNA shuffling uses the power of recombinationto enable the synthesis of a vast number of sequence andfunctionally related variants. The scoring function used incomputational genetic algorithms is here replaced bymeasured protein activity. These variants are screened,the best individuals selected, and used as parents to createthe next set of variants. DNA shuffling is a geneticalgorithm, but done in a wet lab environment. Sincegenetic algorithms have been shown to be very robust atsearching complex function landscapes, it follows thatDNA shuffling is as well. DNA shuffling is a wet labgenetic algorithm used to search through the function–sequence space of DNA, proteins, or other ordered polymerderived molecules to find the local or global optimum thatfulfills the criteria needed to reach the functional goal.

Acknowledgements

We would like to thank Dr Jeremy Minshull for linguistic improvement ofthe manuscript.

REFERENCES

���%��� &'� ())*� � ��� ����� ��� ���+��� �������� ����� ���������� �� �� "���� ��� ��������� �,�! ,-*.,)/�

0������ # � 0�1� 2� ()-*� 3������� ����������� � �������� ������� �� ��� ��������� ���� ����� �� � �! /(/)./(44�

&��� 5 � 6�1 6� ())4� 7"������ �������� ���������8 ������������������ ������9 ������ � ��:�! ,(,;.,(,,�

&����� #< � ��= 5>� ()?-� � �� �� �� ����������� ����� ���������� ����� ������� ������� ����� �! //.4(�

����� #� ()*(� "��� �����+���� �� ����� � ��� ��������� ����������� ������������� ������������������ ��(;�!4?:.:,/�

@���� �1� " � @�� ���� 6�� ())*� ��������� �� �� ��������� �� ��� ������ �� ���� �������� ��4�! 4?(.4??�

2�=� A � 6���� A@� ())?� '�������� �����+���� ������� ��� ���� �� ������� � �����! �������� ��������� ���� � ������ �� ���� �� ��������� ���������� ���� ��?�! 44*.4:4�

2������� "� "1�B ���B� # � C�� "� ()-?� ��� ��� ������ ��

�� �=���� ����������� ������� �� ��������� �� �� ������� �� ����� � ��������� ��������.������� �������������� ���� ���� ��! (/:.(4;�

2���� A2� ()*:� ���������� �� ������� ��� �������� ����������� D�������� �� #������! #�������

A����� #" � <��������� A3� ())/� � �������� �� ���� ����� ��������� �� �������� �� ������ ����� �������� � ��� ��� ���4�! *(?.*/-�

A���� A� E������ �� ����� F� @����� � � C�� "�())/� G��������� � ������������� �� �� �G"�#�H ������� � ����� ����� ����� ���� �� �! *//.*/)�

'����� "� ()):� �� !��� �� ��� "��#����� <���� D��������3��! E�� I��=�

'����� " � F���� "� ()-*� ���� ������ ������ �� ����� ��= �� ����� �� ���� � $���� ��� ���(�!((.4:�

'����� "� � C��������� �&� ()-)� ��� E'�� �� �� ����� %��� �� ��� � �� ��������� �� �������� �� ��������� ������� � $���� ��� ����,�! ,((.,4:�

'�� 6 � 0�1� 2� ()--� 3F �� �������� ��� ! �

������ �� ��� ����������� �������� ���� �� ����� ������= ���������� ��������� ���� ���� ��� �������� ��� ���� ������� ����%��� ���� �� �� �������� � �� ����������� �������� ���� ��%��� ��� �������� ���� �� #���� ���� ��������� �������� ����

EXPLORATION OF SEQUENCE SPACE 313

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314

��������� ������� ��� � ��%����� ��������� �� � � �!,)().,),/�

F�+�� # � 0�1� 2� ()--� 3������� ������ �������� �����%������ �� ������� ������ ��� ���� ��� �� "�� � !-)*/.-)**�

F���� 06 � 0������� ��� ,;;;� 0����� �� ��������!���������� �������������� ���������� � ����������� ��� ������� �� ����� ��������� �� ���=��������� ���� ��� �� "�� ��! ?)-(.?)-:�

#�� 63� ����� �6 � #���� 3A� ())*� &���� �� ����������� �� (:���� �� ����� � ���� &������� �����G"�6 � ����������� ���� ��������� � ���� �� ��!-).(;,�

<���� 2 � #��� E�� ,;;(� @��� ��� � ���� ���� !��������� �� ������� ��������� � ������� �������::()�! (;)?.(;))�

3��� "� "���� J3� 0�=��# � ������ 3� ())-� 3������� ��������� ����� ����� �� � ������ ����� ��� %�� ��?�! :4/.::?�

"� ���� #� ���=�� F� A���� A� "1�B ���B� # � C�� "�())-� E�� ������� �������� ������� ��� ��� ���� ������������� ����� ����� �� � ���������� ��������+������ -* ���� �� � � ��� ���� ��! ,4-(.,4)(�

"������� #2 � 2��� A� ,;;;� ���� ����� �� �������������� �� ������ ������������ ����� &������ � �,�! -*).-)(�

"����� �� @ � C�� � 3� ())4� ��� ������ ���� �� ������ � ����� �� ���%��� ����� ������= � ������ �������� ���������! � ���� ���� �� � � ���+� �� ������� � ������ ���� ������ � �, 3� (�! //:./44�

"����� �� @� "���� � C� C���=� @� #����� A� E��� ��6�����= C� C�� � 3 � '��+� 6� ())-� 3���� � ������ ���%��� ����� ������= � ����������� ������������ ����� ��� ���� ��� �� "�� � �,(�! (,(*).(,(-4�

"������ 3 � "� ��� 35� ())4� F� ���! ������� ������+����� ������� � ���������� ��������� ������ ����

���/�! ,):./,4�"���� A#� ()*;� E���� �������� � ��� ������� �� �������

���� ������ �,/,�! :?/.:?4�"���� 32� ()??� 6������ ������� ������� �������� �

��������� ������� �� ����� �� � $���� ��� ��,�! (:*.():�"������ C3� ())4� 6�� ��������� �� ������� �� ����� �� &E�

���K���� ������ ����?4--�! /-)./)(������ C6� ()-?� J ����%����� �� ������� � ����� �������� ��

������ ������� ��������� � ��� ��� ����,�! ,//.,:-������ C6 � A��� &�� ())/� &������� � ���� �� �����

������ � $���� ��� ���(�! ?:.-/������ #0� @����� � � 2���� @C� ,;;;� &������

���������! ��� L������M �� ��� L��������M ����� �������� ����� ��� ��! 4,(.4,*�

D�=� A@� >��� 0A� 2����� @ � >� �� #��� �� ()*-�"���������������� ���������� �� �� �=���� ���������������� �� ��� � ������ ��,�! (().(,,�

D�=� A@� >��� 0A� 2����� @� C���� 2A � >� �� #��� ��()-,� 5������ �� �� �� ��� ���������������� ���������� ���� �=���������������� ����� �� ��� � ������ ��! (::.(:-�

�� 6���������� #2� ,;;;� ��� ����� ��� ������ ������������� ��� ��������� ����������� ����� ��������!������ ���� � �������� ��������9 � ��� ��'����(�! (.4�

C�� 2�� &��+� A �� �� ())-� "���������+��� ���� �������������= ��� �������� ���� �� � ������������ ���4�!/*?./**�

C�� � 3� F� � <� '��� "� 5���� �� 2�� D � "����� �� @�())-� 3���� � ���� � � �� ����� ������=! ���������������� �� ���%��� ���� ����� � J ������ ���� ��(�������� ��! /:--./:)/�

C����� "� ()/,� ��� ���� �� �������� ������ ���� ������� ���� �������� �� ���������� ��������'� �� )�� *��������������������� �� &������� >��� (N /:?./??�

314 C. GUSTAFSSON ET AL.

Copyright � 2001 John Wiley & Sons, Ltd. J. Mol. Recognit. 2001; 14: 308–314