putting engineering back into protein engineering: bioinformatic approaches to catalyst design

5
Putting engineering back into protein engineering: bioinformatic approaches to catalyst design Claes Gustafsson , Sridhar Govindarajan and Jeremy Minshull Complex multivariate engineering problems are commonplace and not unique to protein engineering. Mathematical and data- mining tools developed in other fields of engineering have now been applied to analyze sequence–activity relationships of peptides and proteins and to assist in the design of proteins and peptides with specified properties. Decreasing costs of DNA sequencing in conjunction with methods to quickly synthesize statistically representative sets of proteins allow modern heuristic statistics to be applied to protein engineering. This provides an alternative approach to expensive assays or unreliable high-throughput surrogate screens. Addresses DNA 2.0, Inc., 1455 Adams Drive, Menlo Park, CA 94025, USA e-mail: [email protected] Current Opinion in Biotechnology 2003, 14:366–370 This review comes from a themed issue on Protein technologies and commercial enzymes Edited by Gjalt Huisman and Stephen Sligar 0958-1669/$ – see front matter ß 2003 Elsevier Ltd. All rights reserved. DOI 10.1016/S0958-1669(03)00101-0 Abbreviations NK1 neurokinin 1 NP non-polynomial PLS partial least squares Introduction Protein engineering has classically been approached from two diametrically opposed directions: rational design and directed evolution. Rational design, in the tradition of Descartes and Leibniz, attempts to understand protein structure and function at a complete mechanistic level so that any desired change can be effected by calculation from first principles. Directed evolution, in the tradition of John Locke and other empiricists, attempts to find a desired solution by testing many different variants, typically using various evolutionary based algorithms. Both rational design and directed evolution in their many alternative formats have shortcomings and advan- tages that have been discussed and compared else- where [1–3]. Modern heuristics applied to protein engineering is a synthesis of empirical data and a rational analysis of that information. The very first paper describing chemical synthesis of a gene proposed that systematic variation of amino acids would enable an understanding of the relationships between the sequence of a protein and its structure, physical behavior and activity [4]. Soon after that, Svante Wold’s group developed and applied multi- variate data analysis techniques to peptide design and suggested that ‘the rapid development of protein engi- neering may then make it possible to produce designed sets of mature proteins and enzymes for QSAR studies’ [5,6]. This review will summarize recent publications in which modern heuristics have been applied to protein engineering and describes technological advances that are enabling Wold’s vision. Protein optimization from an engineering perspective When faced with solving a difficult problem it can be enlightening to see if a similar type of problem has been solved before. Many disciplines and industries face the same challenges of high system complexity and abun- dant variables that confront protein engineering [7]. In some industries increasing complexity is intentional, as in the addition of new control parameters for a car’s combustion engine. Sometimes it is inherent to the system itself, for example, in clinical drug trials. The common challenge in car manufacturing, clinical trials and protein engineering is to account for as much of this complexity as possible when describing the relationship between input variables (e.g. piston angle and tempera- ture for car engines, age and medical history for patients or amino acid residues available at each position for pro- tein engineering [8]) and output variables (e.g. exhaust levels and fuel efficiency for cars, side effects and surv- ival rate for patients or the desired commercial proper- ties such as catalytic activity, thermostability, substrate specificity and immunogenicity for protein engineering). Measured output variables may in turn result from com- binations of properties that are not explicitly measured; for protein engineering, these may include expression levels and protein solubility [9]. Like small-molecule quantitative structure–activity relationships (QSAR), which have enjoyed much success in pharmaceutical development, heuristic protein engineering aims to identify the relationship between input and output var- iables to create biological macromolecules with defined properties. For reasons described below, more work has been published optimizing peptides than proteins using engineering concepts. We therefore use peptide examples to describe some of the principles before describing how the same engineering tools are used to optimize proteins. 366 Current Opinion in Biotechnology 2003, 14:366–370 www.current-opinion.com

Upload: claes-gustafsson

Post on 14-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Putting engineering back into protein engineering:bioinformatic approaches to catalyst designClaes Gustafsson�, Sridhar Govindarajan and Jeremy Minshull

Complex multivariate engineering problems are commonplace

and not unique to protein engineering. Mathematical and data-

mining tools developed in other fields of engineering have now

been applied to analyze sequence–activity relationships of

peptides and proteins and to assist in the design of proteins

and peptides with specified properties. Decreasing costs of

DNA sequencing in conjunction with methods to quickly

synthesize statistically representative sets of proteins allow

modern heuristic statistics to be applied to protein

engineering. This provides an alternative approach to

expensive assays or unreliable high-throughput surrogate

screens.

AddressesDNA 2.0, Inc., 1455 Adams Drive, Menlo Park, CA 94025, USA�e-mail: [email protected]

Current Opinion in Biotechnology 2003, 14:366–370

This review comes from a themed issue on

Protein technologies and commercial enzymes

Edited by Gjalt Huisman and Stephen Sligar

0958-1669/$ – see front matter

� 2003 Elsevier Ltd. All rights reserved.

DOI 10.1016/S0958-1669(03)00101-0

AbbreviationsNK1 neurokinin 1

NP non-polynomial

PLS partial least squares

IntroductionProtein engineering has classically been approached from

two diametrically opposed directions: rational design and

directed evolution. Rational design, in the tradition of

Descartes and Leibniz, attempts to understand protein

structure and function at a complete mechanistic level so

that any desired change can be effected by calculation

from first principles. Directed evolution, in the tradition

of John Locke and other empiricists, attempts to find

a desired solution by testing many different variants,

typically using various evolutionary based algorithms.

Both rational design and directed evolution in their

many alternative formats have shortcomings and advan-

tages that have been discussed and compared else-

where [1–3].

Modern heuristics applied to protein engineering is a

synthesis of empirical data and a rational analysis of that

information. The very first paper describing chemical

synthesis of a gene proposed that systematic variation

of amino acids would enable an understanding of the

relationships between the sequence of a protein and its

structure, physical behavior and activity [4]. Soon after

that, Svante Wold’s group developed and applied multi-

variate data analysis techniques to peptide design and

suggested that ‘the rapid development of protein engi-

neering may then make it possible to produce designed

sets of mature proteins and enzymes for QSAR studies’

[5,6]. This review will summarize recent publications in

which modern heuristics have been applied to protein

engineering and describes technological advances that are

enabling Wold’s vision.

Protein optimization from an engineeringperspectiveWhen faced with solving a difficult problem it can be

enlightening to see if a similar type of problem has been

solved before. Many disciplines and industries face the

same challenges of high system complexity and abun-

dant variables that confront protein engineering [7]. In

some industries increasing complexity is intentional, as

in the addition of new control parameters for a car’s

combustion engine. Sometimes it is inherent to the

system itself, for example, in clinical drug trials. The

common challenge in car manufacturing, clinical trials

and protein engineering is to account for as much of this

complexity as possible when describing the relationship

between input variables (e.g. piston angle and tempera-

ture for car engines, age and medical history for patients

or amino acid residues available at each position for pro-

tein engineering [8]) and output variables (e.g. exhaust

levels and fuel efficiency for cars, side effects and surv-

ival rate for patients or the desired commercial proper-

ties such as catalytic activity, thermostability, substrate

specificity and immunogenicity for protein engineering).

Measured output variables may in turn result from com-

binations of properties that are not explicitly measured;

for protein engineering, these may include expression

levels and protein solubility [9]. Like small-molecule

quantitative structure–activity relationships (QSAR),

which have enjoyed much success in pharmaceutical

development, heuristic protein engineering aims to

identify the relationship between input and output var-

iables to create biological macromolecules with defined

properties. For reasons described below, more work

has been published optimizing peptides than proteins

using engineering concepts. We therefore use peptide

examples to describe some of the principles before

describing how the same engineering tools are used to

optimize proteins.

366

Current Opinion in Biotechnology 2003, 14:366–370 www.current-opinion.com

Navigating in protein sequence spaceProtein engineering can be divided into two subtasks:

defining the solution space and defining the search

algorithm.

Define the solution space

The total possible number of proteins encoded by a 1 kb

gene is 20333 (20 alternative amino acids at each position

in a string of 333 residues) �10430. This is an unfeasibly

large number of variants to screen. Fortunately, not all

possible sequences need be considered as naturally occur-

ring proteins can usually be relied on to provide a starting

point for engineering efforts. Active point-mutants [10],

phylogenetic substitutions [11��], structural modeling

[12,13] and known immunogenic constraints [14] are

well-explored methods of targeting specific regions of a

protein for change.

Define the search algorithm

Protein engineering is a non-polynomial (NP)-complete

problem [15,16], meaning that the problem scales non-

polynomially with increasing complexity and no known

algorithm can guarantee determining the optimal solution

without evaluating all possible solutions. Empirical pro-

tein engineers have largely limited themselves to ad-

dress the NP-complete problem with exhaustive searches

using ultra-high-throughput phage and ribosome display

screens [17,18] or evolutionary methods [1–3,19]. By

contrast, the wider engineering community has exploited

genetic algorithms as well as regression-based algorithms,

neural nets, clustering, and several other tools as alter-

native techniques to address NP-complete problems [20].

Statistical targeting of amino acid changesComparisons of natural protein and DNA sequences,

particularly those using the powerful technique of prin-

cipal component analysis, can be used to identify residues

that are important for specific functionality within a

protein [21,22�,23�,24,25]. Natural substitution patterns

can also be used to infer which changes are likely to be

acceptable within functional proteins. For example, a

recent study of subtilisin variants found that all 52 of

the amino acid variations found in 15 homologs were

active within the context of at least one backbone; their

incorporation produced proteases with varying catalytic

properties [26�]. In another set of experiments, all of the

active-site residues from one fungal phytase were

replaced with those from another, again the result was

an active protein with altered catalytic properties [11��].By incorporating small numbers of changes identified

from alignments of naturally occurring sequences, it

has also been possible to increase the thermostability

of a fungal phytase by over 308C [27]. Substitution

matrices derived from synonymous and non-synonymous

substitution rates can also be used to choose reasonable

amino acid changes if there is insufficient phylogenetic

data to use sequence alignments [28–30,31�].

Multivariate design of improved polypeptidesFigure 1 shows a procedure for peptide optimization

derived from the one used by Norinder et al. [32] to

design analogs of the neuropeptide substance P with

increased affinity for the neurokinin 1 (NK1) receptor.

These authors used partial least squares (PLS) regression

[33,34] to correlate the sequences of 36 substance P

analogs with their activities. They used this model to

identify the positions and amino acid properties in sub-

stance P that had the largest effects on NK1 binding. The

authors designed, synthesized and tested six new pep-

tides that the model predicted to be improved NK1

binders. All six were shown to be highly active. Their

sequence–activity data was added to the first 36 peptides

to build a second generation PLS model, which was used

to design a further three variants. One of these had an

IC50 of 5 pM, 300-fold better than the wild-type peptide

and 45-fold better than the best of the original 36 variants

[32]. It is striking that extremely small numbers of var-

iants (45) were made and tested to achieve very signif-

icant improvements in the desired function.

The same techniques have also been applied to proteins.

In one particularly informative example, Bucht and col-

leagues optimized a complex protein phenotype: the

activity of acetylcholinesterase expressed on the surface

of human COS-1 cells. Display of acetylcholinesterase on

the cell surface occurs as a result of glycosyl phosphati-

dylinositol modification at the C terminus of the protein.

The authors identified two amino acids in the signal

peptide region of the protein, the identity of which

affected cell-surface localization of the protein. They

synthesized eight variant genes, tested the surface ex-

pression of the eight encoded proteins and used PLS to

Figure 1

Add new data to refine

sequence–activity model

Create initial set of variantsand measure desired phenotype

Build sequence–activity model

Design new variants based on modelpredictions for high performing sequences

Synthesize and test new variants

Current Opinion in Biotechnology

Polypeptide optimization using mathematical models. The process is

that used by Norinder et al. [32] for the optimization of the neuropeptide

substance P.

Bioinformatic approaches to catalyst design Gustafsson, Govindarajan and Minshull 367

www.current-opinion.com Current Opinion in Biotechnology 2003, 14:366–370

model the sequence–activity relationship. The authors

then constructed an additional 27 variants in this same

region of the protein, using them to test and refine the

model, thereby identifying the optimal sequence for cell-

surface expression of acetylcholinesterase [35��]. Mod-

eling sequence–activity relationships to identify optimal

protein variants has not been limited to amino acids

localized to a small region of a protein. Statistical analysis

of mutations distributed throughout several enzymes has

been used to identify the contributions of those changes

to function of the protein [36] and to predict the se-

quence with best function [37]. Mathematical sequence–

activity modeling has thus been validated at many scales

of complexity: from small molecules to peptides to loca-

lized regions of proteins to changes spread throughout

entire proteins.

Although there is a growing body of work in which

sequence–activity relationships are used to design im-

proved peptides [5,6,38,39], application of the same

methods to protein/biocatalyst engineering is still in its

infancy. One reason for this has been the difficulty in

producing large numbers of modified molecules [40�]; in

contrast to peptides, proteins cannot easily be synthesized

directly. As technology improves, the synthesis of indi-

vidually designed genes becomes increasingly cost-

effective [41�,42]. Testing variants taken from libraries

that are even cheaper to produce is also likely to produce

useful sequence–activity relationships [43�].

Experimental design of maximallyinformative datasetsAnother useful statistical tool with its origins in other

engineering disciplines is that of experimental design.

This is a technique by which a variant set is designed to

contain the maximum amount of information for sub-

sequent analysis of sequence–activity data [44]. Using

D-optimal design, Mee et al. [45] designed, synthesized

and tested a training set of 60 analogs of a 15 amino acid

antibacterial peptide. A regression-based model derived

from the sequence–activity correlation of the 60 data-

points was used to design and synthesize 39 new peptides

predicted to have improved activity. The best designed

peptide was twice as potent as the best one in the training

set. In their selection of acetylcholinesterase variants,

Bucht et al. [35��] also used experimental design to choose

the eight gene variants that would best represent the

sequence variation they were exploring.

Accounting for amino acid interactionsIf an amino acid change at one position affects the

functional consequences of changing other amino acids

in a protein, predictive sequence–function models must

account for this. A model that incorporates amino acid

interactions requires more data than one that assumes that

the amino acids act to achieve the same quality of model

[40�,46��]. In studies of antigen–antibody binding [40�]

and ligand–receptor binding [47��], researchers found

that very few interaction terms (and thus very little

additional data) were needed to produce accurate descrip-

tions of the sequence–activity relationship.

Recent work from Husimi’s group suggests that this result

is also true for proteins. Individual amino acid changes

contributing to specific properties of dihydrofolate reduc-

tase [36], thermolysin and prolyl endopeptidase [37] are

approximately independent. Of particular interest is a

recent study in which only two of 14 randomly generated

mutations that increased prolyl endopeptidase thermo-

stability appeared to be interdependent. The authors’

model contained a single interaction term to account for

this residue pair. A gene variant containing the pair pre-

dicted to interact was synthesized and tested; its activity

was shown be as predicted by the model. Only 45 gene

variants were needed to accurately model the activities of

16 384 possible sequence combinations [46��].

Heuristic methods are becoming morewidespreadOther successful examples of heuristic approaches to

analyze and optimize biological systems include the

optimization of peptidase I using neural networks [48],

calculations of individual amino acid contributions to

serine protease inhibitor activity [49��], PLS-based pre-

diction of the determinants of protein localization [50,51],

and protein contact map and interaction site prediction

using neural networks [52]. In work complementing

modeling to assess the contributions of small numbers

of changes at many positions, sequence–activity relation-

ships have been derived using PLS to quantitate the

effects of multiple amino acid substitutions at single

positions in haloalkane dehalogenase, T4 lysozyme, sub-

tilisin and tryptophan synthase. These methods have also

been used to determine the physicochemical properties

required at identified positions to confer specific enzyme

properties [53]. Furthermore, the same tools have been

used to systematically characterize the substrates for a set

of haloalkane dehalogenase variants to determine the

effects of amino acid changes on substrate specificity of

the enzyme [54].

Conclusions: drivers for changeBy casting the protein engineering problem as an opti-

mization problem common to other engineering disci-

plines, we are able to exploit many different problem

solving algorithms. Gone are the technological barriers

to synthesizing statistically representative datasets. As

Wold predicted in 1986, the capture of protein sequence–

activity relationships nowpermits thedesign ofoptimized

proteins.

There are several drivers for applying modern engineer-

ing tools to protein engineering. Firstly, the human gen-

ome project, microarrays and other recent large scientific

368 Protein technologies and commercial enzymes

Current Opinion in Biotechnology 2003, 14:366–370 www.current-opinion.com

endeavours have changed biology from a ‘one variable at a

time’ science to a science engulfed in variables. Secondly,

statistical tools developed and deployed in a variety of

engineering areas can now be operated by non-statisticians

from any desktop computer. Finally, the cost of generating

and sequencing statistically representative sets of genes is

continuously decreasing.

It is striking that by measuring the contribution of amino

acid variations to the function of a protein, sequence–

activity modeling requires orders of magnitude fewer

variants to be tested to design improved sequences than

the numbers screened using widespread directed evolu-

tion techniques. This is important, because methodologies

that rely upon screening large sample sets are vulnerable

to the weakness that high-throughput screens often turn

out to have limited ability to measure the protein proper-

ties that are really important [2,19,40�]. Heuristic meth-

odologies may therefore permit protein engineers to test

fewer variants under conditions that more closely approx-

imate their final intended applications and reduce the time

and resources that are often spent in building and imple-

menting imprecise high-throughput screens.

AcknowledgementsOne of us (CG) began this manuscript while employed at Maxygen Inc.We thank Maxygen for their support.

References and recommended readingPapers of particular interest, published within the annual period ofreview, have been highlighted as:

� of special interest��of outstanding interest

1. Tobin MB, Gustafsson C, Huisman GW: Directed evolution: the‘rational’ basis for ‘irrational’ design. Curr Opin Struct Biol 2000,10:421-427.

2. van Regenmortel MH: Are there two distinct research strategiesfor developing biologically active molecules: rational designand empirical selection? J Mol Recognit 2000, 13:1-4.

3. Ryu DD, Nam DH: Recent progress in biomolecular engineering.Biotechnol Prog 2000, 16:2-16.

4. Nambiar KP, Stackhouse J, Stauffer DM, Kennedy WP, EldredgeJK, Benner SA: Total synthesis and cloning of a gene coding forthe ribonuclease S protein. Science 1984, 223:1299-1301.

5. Hellberg S: A Multivariate Approach to QSAR. PhD thesis. Umea,Sweden: University of Umea: 1986.

6. Hellberg S, Sjostrom M, Skagerberg B, Wold S: Peptidequantitative structure-activity relationships, a multivariateapproach. J Med Chem 1987, 30:1126-1135.

7. Gustafsson C, Govindarajan S, Emig R: Exploration of sequencespace for protein engineering. J Mol Recognit 2001, 14:308-314.

8. Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S: Newchemical descriptors relevant for the design of biologicallyactive peptides. A multivariate characterization of 87 aminoacids. J Med Chem 1998, 41:2481-2491.

9. Lin Z, Thorsen T, Arnold FH: Functional expression ofhorseradish peroxidase in E. coli by directed evolution.Biotechnol Prog 1999, 15:467-471.

10. Glieder A, Farinas ET, Arnold FH: Laboratory evolution of asoluble, self-sufficient, highly active alkane hydroxylase.Nat Biotechnol 2002, 20:1135-1139.

11.��

Lehmann M, Lopez-Ulibarri R, Loch C, Viarouge C, Wyss M,van Loon AP: Exchanging the active site between phytases foraltering the functional properties of the enzyme. Protein Sci2000, 9:1866-1872.

Demonstration that residues identified as functionally important (in thiscase the entire active site) can be moved from one protein backbone toanother, leading to functionally novel catalysts.

12. Looger LL, Dwyer MA, Smith JJ, Hellinga HW: Computationaldesign of receptor and sensor proteins with novel functions.Nature 2003, 423:185-190.

13. Kwasigroch JM, Gilis D, Dehouck Y, Rooman M: PoPMuSiC,rationally designing point mutations in protein structures.Bioinformatics 2002, 18:1701-1702.

14. Tangri S, LiCalsi C, Sidney J, Sette A: Rationally engineeredproteins or antibodies with absent or reduced immunogenicity.Curr Med Chem 2002, 9:2191-2199.

15. Pierce NA, Winfree E: Protein design is NP-hard. Protein Eng2002, 15:779-782.

16. Lathrop RH: The protein threading problem with sequenceamino acid interaction preferences is NP-complete. Protein Eng1994, 7:1059-1068.

17. Hanes J, Pluckthun A: In vitro selection and evolution offunctional proteins by using ribosome display. Proc Natl AcadSci USA 1997, 94:4937-4942.

18. Wells JA, Lowman HB: Rapid evolution of peptide and proteinbinding properties in vitro. Curr Opin Biotechnol 1992,3:355-362.

19. Ness JE, del Cardayre SB, Minshull J, Stemmer WP: Molecularbreeding: the natural approach to protein design. Adv ProteinChem 2000, 55:261-292.

20. Johnson DS, McGeoch LA: The traveling salesman problem: acase study in local optimization. In Local Search in CombinatorialOptimization. Edited by Aarts EHL, Lenstra JK, Aarts EL: John Wiley& Sons Ltd; 1997:215-310.

21. Casari G, Sander C, Valencia A: A method to predict functionalresidues in proteins. Nat Struct Biol 1995, 2:171-178.

22.�

del Sol Mesa A, Pazos F, Valencia A: Automatic methods forpredicting functionally important residues. J Mol Biol 2003,326:1289-1302.

Excellent comparison of methods available to identify residues thatcontribute to protein function.

23.�

Gogos A, Jantz D, Senturker S, Richardson D, Dizdaroglu M,Clarke ND: Assignment of enzyme substrate specificity byprincipal component analysis of aligned protein sequences: anexperimental test using DNA glycosylase homologs.Proteins 2000, 40:98-105.

Principal component analysis of small numbers of proteins used toidentify residues likely to be involved in substrate specificity deter-mination.

24. Suzuki Y, Gojobori T: A method for detecting positiveselection at single amino acid sites. Mol Biol Evol 1999,16:1315-1328.

25. Jonsson J, Norberg T, Carlsson L, Gustafsson C, Wold S:Quantitative sequence-activity models (QSAM) — tools forsequence design. Nucleic Acids Res 1993, 21:733-739.

26.�

Govindarajan S, Ness JE, Kim S, Mundorff EC, Minshull J,Gustafsson C: Systematic variation of amino acid substitutionsfor stringent assessment of pairwise covariation. J Mol Biol2003, 328:1061-1069.

Fifty-two phylogenetically identified substitutions in subtilisins areaccepted into one enzyme backbone, modifying its activity. Most naturalchanges that occur together are shown to be a result of descent from acommon ancestor and not a result of functional constraints.

27. Lehmann M, Loch C, Middendorf A, Studer D, Lassen SF,Pasamontes L, van Loon AP, Wyss M: The consensus concept forthermostability engineering of proteins: further proof ofconcept. Protein Eng 2002, 15:403-411.

28. Benner SA, Cohen MA, Gonnet GH: Amino acid substitutionduring functionally constrained divergent evolution of proteinsequences. Protein Eng 1994, 7:1323-1332.

Bioinformatic approaches to catalyst design Gustafsson, Govindarajan and Minshull 369

www.current-opinion.com Current Opinion in Biotechnology 2003, 14:366–370

29. Wu TD, Brutlag DL: Discovering empirically conserved aminoacid substitution groups in databases of protein families.Proc Int Conf Intell Syst Mol Biol 1996, 4:230-240.

30. Adenot M, Sarrauste de Menthiere C, Chavanieu A, Calas B,Grassy G: Peptides quantitative structure-functionrelationships: an automated mutation strategy to designpeptides and pseudopeptides from substitution matrices. J MolGraph Model 1999, 17:292-309.

31.�

Dimmic MW, Rest JS, Mindell DP, Goldstein RA: rtREV: an aminoacid substitution matrix for inference of retrovirus and reversetranscriptase phylogeny. J Mol Evol 2002, 55:65-73.

A substitution matrix for maximum likelihood phylogenetic analysis isdeveloped that is optimized on a subset of sequences. Substitutionmatrices are unique for each sequence subset.

32. Norinder U, Rivera C, Unden A: A quantitative structure-activityrelationship study of some substance P-related peptides. Amultivariate approach using PLS and variable selection.J Pept Res 1997, 49:155-162.

33. Sandberg M: Deciphering Sequence Data, a Multivariate Approach.PhD thesis. Umea: Umea University: 1997.

34. Geladi P, Kowalski BR: Partial least squares regression: atutorial. Anal Chim Acta 1986, 186:1-17.

35.��

Bucht G, Wikstrom P, Hjalmarsson K: Optimising the signalpeptide for glycosyl phosphatidylinositol modification ofhuman acetylcholinesterase using mutational analysis andpeptide-quantitative structure-activity relationships.Biochim Biophys Acta 1999, 1431:471-482.

PLS and experimental design are used to optimize acetylcholinesterase,increasing its surface expression on cells threefold.

36. Aita T, Iwakura M, Husimi Y: A cross-section of the fitnesslandscape of dihydrofolate reductase. Protein Eng 2001,14:633-638.

37. Aita T, Uchiyama H, Inaoka T, Nakajima M, Kokubo T, Husimi Y:Analysis of a local fitness landscape with a model of the roughMt. Fuji-type landscape: application to prolyl endopeptidaseand thermolysin. Biopolymers 2000, 54:64-79.

38. Strom MB, Haug BE, Rekdal O, Skar ML, Stensen W, Svendsen JS:Important structural features of 15-residue lactoferricinderivatives and methods for improvement of antimicrobialactivity. Biochem Cell Biol 2002, 80:65-74.

39. Eriksson L, Jonsson J, Hellberg S, Lindgren F, Skagerberg B,Sjostrom M, Wold S: Peptide QSAR on substance P analogues,enkephalins and bradykinins containing L- and D-amino acids.Acta Chem Scand A 1990, 44:50-55.

40.�

Choulier L, Andersson K, Hamalainen MD, van Regenmortel MH,Malmqvist M, Altschuh D: QSAR studies applied to the predictionof antigen-antibody interaction kinetics as measured byBIAcore. Protein Eng 2002, 15:373-382.

Multivariate analysis applied to sequence optimization and reactionconditions.

41.�

Hoover DM, Lubkowski J: DNAWorks: an automated method fordesigning oligonucleotides for PCR-based gene synthesis.Nucleic Acids Res 2002, 30:e43.

The shape of things to come. Gene synthesis gets cheaper and easier.

42. Holowachuk EW, Ruhoff MS: Efficient gene synthesis by Klenowassembly/extension-Pfu polymerase amplification (KAPPA) ofoverlapping oligonucleotides. PCR Methods Appl 1995,4:299-302.

43.�

Abecassis V, Pompon D, Truan G: High efficiency family shufflingbased on multi-step PCR and in vivo DNA recombination inyeast: statistical and functional analysis of a combinatoriallibrary between human cytochrome P450 1A1 and 1A2.Nucleic Acids Res 2000, 28:E88.

One of many library synthesis methods. Interesting analysis of variants inwhich hybridization signals instead of known sequence changes are usedas input variables for modeling.

44. Hellberg S, Eriksson L, Jonsson J, Lindgren F, Sjostrom M,Skagerberg B, Wold S, Andrews P: Minimum analogue peptidesets (MAPS) for quantitative structure-activity relationships.Int J Pept Protein Res 1991, 37:414-424.

45. Mee RP, Auton TR, Morgan PJ: Design of active analogues of a15-residue peptide using D-optimal design, QSAR and acombinatorial search algorithm. J Pept Res 1997, 49:89-102.

46.��

Aita T, Hamamatsu N, Nomiya Y, Uchiyama H, Shibanaka Y,Husimi Y: Surveying a local fitness landscape of a protein withepistatic sites for the study of directed evolution. Biopolymers2002, 64:95-105.

A model of only 45 prolyl endopeptidase variants accurately predicts theactivities of combinations of 14 different mutations. Only one interactionterm in required in the model.

47.��

Prusis P, Lundstedt T, Wikberg JE: Proteo-chemometricsanalysis of MSH peptide binding to melanocortin receptors.Protein Eng 2002, 15:305-311.

Statistically representative sets of melanocortin peptide and chimericreceptors were analyzed. Models incorporated linear and interactionterms; predictions were externally validated.

48. Schneider G, Schrodl W, Wallukat G, Muller J, Nissen E,Ronspeck W, Wrede P, Kunze R: Peptide design by artificialneural networks and computer-based evolutionary search.Proc Natl Acad Sci USA 1998, 95:12179-12184.

49.��

Lu SM, Lu W, Qasim MA, Anderson S, Apostol I, Ardelt W, Bigler T,Chiang YW, Cook J, James MN et al.: Predicting the reactivity ofproteins from their sequence alone: Kazal family of proteininhibitors of serine proteinases. Proc Natl Acad Sci USA 2001,98:1410-1415.

The conclusion of an heroic 20 year study. By synthesizing and testing<200 variants, activities of many natural proteinases can be accuratelypredicted.

50. Sjostrom M, Wold S, Wieslander A, Rilfors L: Signal peptide aminoacid sequences in Escherichia coli contain information relatedto final protein localization. A multivariate data analysis.EMBO 1987, 6:823-831.

51. Schein AI, Kissinger JC, Ungar LH: Chloroplast transit peptideprediction: a peek inside the black box. Nucleic Acids Res 2001,29:E82.

52. Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein–protein interaction sites in heterocomplexes with neuralnetworks. Eur J Biochem 2002, 269:1356-1361.

53. Damborsky J: Quantitative structure-function and structure-stability relationships of purposely modified proteins.Protein Eng 1998, 11:21-30.

54. Marvanova S, Nagata Y, Wimmerova M, Sykorova J, Hynkova K,Damborsky J: Biochemical characterization of broad-specificity enzymes using multivariate experimental designand a colorimetric microplate assay: characterization of thehaloalkane dehalogenase mutants. J Microbiol Methods 2001,44:149-157.

370 Protein technologies and commercial enzymes

Current Opinion in Biotechnology 2003, 14:366–370 www.current-opinion.com