classification of proteins based on minimal modular repeats:  lessons from nature in protein design

10
Classification of Proteins Based on Minimal Modular Repeats: Lessons from Nature in Protein Design Brett M. Barney* Department of Chemistry and Biochemistry, 0300 Old Main Hill, Utah State University, Logan, Utah 84322 Received April 14, 2005 Proteins containing internal repeats within their primary sequence have received increased attention recently, as the extent of their presence in various organisms is recognized more fully, and their role in evolution is more thoroughly studied. Presented here is a technique used to detect and classify proteins based on a modular evolutionary phenomenon that results in a series of small internal repeats. The parameters chosen are based on a minimum segment of seven residues that result in simple functional scaffolds. The genomes and corresponding proteomes of a variety of eubacteria and archaea have been analyzed using an algorithm that searches prokaryotic genomes for proteins containing small conserved repeats assembled in a modular fashion similar to a recently characterized protein from the organism Nitrosomonas europaea. This analysis has revealed additional proteins present in N. europaea with similar modular characteristics. A further survey of a variety of organisms demonstrates that this evolutionary pathway has been utilized in other organisms as well, to yield a broad assortment of small modular proteins. A thorough description of the sequential characteristics of these modular proteins follows, along with a selection and discussion of the various proteins uncovered through this expanded search and analysis. Several databases of the proteins uncovered from this work and the program used to perform the search are available. Keywords: internal protein repeats modular design algorithm small proteins database Introduction Evolution is a process of selective improvement based on inheritance, mutation, and reorganization of genes to meet changing needs. In the microbial world, genes are readily transferred between species through a variety of processes. 1,2 and transferred genes that become beneficial to the survival of a rapidly evolving organism are often incorporated directly into the genome. Many genes evolve from previous functional scaffolds, which already exist as another protein within the genome. The diversity of this process is broad, and enhances the ability of organisms to adapt to a variety of environments. The incidence of genes that have borrowed significant sequence similarity from other genes is illustrated by the extensive number of proteins that have high similarity to other proteins with separate functions within a single genome. These can occur either through duplications of entire genes as protein paralogs, 3 or as repeated sequences of amino acids within the same gene (internal repeat proteins) 2-4 as discussed here. In contrast to the many families of proteins with similarities in either related or unrelated species, many of the genes that are uncovered with each new genome sequencing project are unique hypothetical proteins, or singletons, that have no known homologues, even in closely related species. 2 In some cases, as described previously and addressed here, these unique hypothetical proteins also contain internal repeats, 5 resulting in a new class of internal repeat protein. Protein internal repeats come in a variety of forms, ranging from single residue repeats, through small tandem repeats, up to large domain repeats of hundreds of residues. 6 The concept of protein evolution occurring by duplication of specific sequences within genes goes back several decades 7 and cellular mechanisms resulting in repetitive proteins are well estab- lished. 1,2,4,8 The topic of sequence repeats within single proteins has been reviewed in recent years, 2,9 and a variety of al- gorithms 3,10-13 have been developed to search for regions of high similarity, not only between different proteins, 14 but also within individual proteins. 3,10,13 Classes of protein repeats such as Leucine rich and ankyrin repeats are well established in the literature, 9,15,16 as well as the relationships of many repeat proteins to different cellular functions. 8 Repeating sequences have also been implicated as a major feature in certain classes of proteins, such as intrinsically unstructured proteins. 17 In recent years, the topic of repeating sequences has become relevant for proteins related to various diseases, 4,6 including the prion proteins. 17-19 Estimates based on previous assessments show that the existence of repeats in the primary structure of proteins from several species is significant (about 14%). 3,8 With the advent of modern whole genome sequencing projects, a great deal of information is now available for large scale data mining, and has been used to make many general and specific observations * To whom correspondence should be addressed. Tel: (435) 797-7392. Fax: (435) 797-3390. E-mail: [email protected]. 10.1021/pr050103m CCC: $33.50 2006 American Chemical Society Journal of Proteome Research 2006, 5, 473-482 473 Published on Web 02/08/2006

Upload: brett-m

Post on 21-Feb-2017

217 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

Classification of Proteins Based on Minimal Modular Repeats:

Lessons from Nature in Protein Design

Brett M. Barney*

Department of Chemistry and Biochemistry, 0300 Old Main Hill, Utah State University, Logan, Utah 84322

Received April 14, 2005

Proteins containing internal repeats within their primary sequence have received increased attentionrecently, as the extent of their presence in various organisms is recognized more fully, and their rolein evolution is more thoroughly studied. Presented here is a technique used to detect and classifyproteins based on a modular evolutionary phenomenon that results in a series of small internal repeats.The parameters chosen are based on a minimum segment of seven residues that result in simplefunctional scaffolds. The genomes and corresponding proteomes of a variety of eubacteria and archaeahave been analyzed using an algorithm that searches prokaryotic genomes for proteins containingsmall conserved repeats assembled in a modular fashion similar to a recently characterized proteinfrom the organism Nitrosomonas europaea. This analysis has revealed additional proteins present inN. europaea with similar modular characteristics. A further survey of a variety of organismsdemonstrates that this evolutionary pathway has been utilized in other organisms as well, to yield abroad assortment of small modular proteins. A thorough description of the sequential characteristicsof these modular proteins follows, along with a selection and discussion of the various proteinsuncovered through this expanded search and analysis. Several databases of the proteins uncoveredfrom this work and the program used to perform the search are available.

Keywords: internal protein repeats • modular design • algorithm • small proteins • database

Introduction

Evolution is a process of selective improvement based oninheritance, mutation, and reorganization of genes to meetchanging needs. In the microbial world, genes are readilytransferred between species through a variety of processes.1,2

and transferred genes that become beneficial to the survivalof a rapidly evolving organism are often incorporated directlyinto the genome. Many genes evolve from previous functionalscaffolds, which already exist as another protein within thegenome. The diversity of this process is broad, and enhancesthe ability of organisms to adapt to a variety of environments.The incidence of genes that have borrowed significant sequencesimilarity from other genes is illustrated by the extensivenumber of proteins that have high similarity to other proteinswith separate functions within a single genome. These canoccur either through duplications of entire genes as proteinparalogs,3 or as repeated sequences of amino acids within thesame gene (internal repeat proteins)2-4 as discussed here.

In contrast to the many families of proteins with similaritiesin either related or unrelated species, many of the genes thatare uncovered with each new genome sequencing project areunique hypothetical proteins, or singletons, that have no knownhomologues, even in closely related species.2 In some cases,as described previously and addressed here, these unique

hypothetical proteins also contain internal repeats,5 resultingin a new class of internal repeat protein.

Protein internal repeats come in a variety of forms, rangingfrom single residue repeats, through small tandem repeats, upto large domain repeats of hundreds of residues.6 The conceptof protein evolution occurring by duplication of specificsequences within genes goes back several decades7 and cellularmechanisms resulting in repetitive proteins are well estab-lished.1,2,4,8 The topic of sequence repeats within single proteinshas been reviewed in recent years,2,9 and a variety of al-gorithms3,10-13 have been developed to search for regions ofhigh similarity, not only between different proteins,14 but alsowithin individual proteins.3,10,13 Classes of protein repeats suchas Leucine rich and ankyrin repeats are well established in theliterature,9,15,16 as well as the relationships of many repeatproteins to different cellular functions.8 Repeating sequenceshave also been implicated as a major feature in certain classesof proteins, such as intrinsically unstructured proteins.17 Inrecent years, the topic of repeating sequences has becomerelevant for proteins related to various diseases,4,6 includingthe prion proteins.17-19

Estimates based on previous assessments show that theexistence of repeats in the primary structure of proteins fromseveral species is significant (about 14%).3,8 With the advent ofmodern whole genome sequencing projects, a great deal ofinformation is now available for large scale data mining, andhas been used to make many general and specific observations

* To whom correspondence should be addressed. Tel: (435) 797-7392.Fax: (435) 797-3390. E-mail: [email protected].

10.1021/pr050103m CCC: $33.50 2006 American Chemical Society Journal of Proteome Research 2006, 5, 473-482 473Published on Web 02/08/2006

Page 2: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

about the processes used throughout evolution to transferinformation and develop new proteins.

Previous work with a small metal binding protein (SmbP)(NCBI protein accession number NP_842452) revealed a patternof repeats based on a seven-residue motif with high conserva-tion of specific residues.5 As a model, the seven-residue repeatunit is a common feature of coiled-coils,17,20,21 though coiled-coils are not believed to be a feature of SmbP. Seven-residuerepeats have also been shown to be a feature of the transmem-brane regions from a number of viral proteins.22,23 In this work,the modular features of SmbP were used to define the criteriafor a simple modular protein, and to construct an algorithmwhich would analyze the primary sequences of small proteinsto identify other proteins sharing similar features of modulardesign. This effort has resulted in a simple and easily amenablealgorithm to search for similar modular proteins within the hostorganism, Nitrosomonas europaea. The search was furtherexpanded to include a broad selection of prokaryotic genomes.A summary of these findings along with a database of proteinsidentified is provided. The program used to perform thesearches is also freely available at http://cc.usu.edu/∼bbarney/homepage.htm.

Materials and Methods

Protein Identification and Analysis Software. All proteinsare cross-referenced in this work according to the NationalCenter for Biotechnology Information (NCBI) accession num-bers. The translated protein sequences in FASTA format forindividual prokaryotic genomes were obtained through theNCBI microbial genomes website (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi), and were subsequently used as thequery database for each genome. In addition, sequences fromthe Research Collaboratory for Structural Bioinformatics (RCSB)protein data bank24 were obtained as the raw FASTA proteinprimary sequence data, and modified to remove redundantsequences. Individual structures for a selection of the proteinsidentified were then obtained and viewed for comparison anddeterminations of the roles repeats played in each structure.Multiple alignments of protein sequences were done usingMultalin,25 while BLAST14 searches were accomplished usingthe NCBI web-based search engines.

Identification of Proteins with Simple Modular Charac-teristics. The aim of this study was to search the large databaseof genome sequences for proteins with a primary sequencecontaining modular characteristics similar to that revealed inthe protein SmbP.5 To perform such a search, the character-istics that describe the features of the primary sequence of thismodular protein had to be defined. Four characteristics wereselected as search criteria which define the modular nature ofthis protein.

First, SmbP is a small protein of 117 residues (93 followingremoval of an N-terminus leader sequence which directs it tothe periplasm).5 Size was selected as a search constraint duealso to the fact that increased size leads to an increasedprobability that additional sequences with similar compositionsmight occur simply by chance. The size requirement was thusa factor of the search, since small proteins with multiple repeatsof high similarity were the target of this work. So as not to biasthe search unintentionally, and still include the parameter ofsize as a restriction, a limit of 250 residues was selected in thisinitial assessment to meet the requirement of the protein beingsmall (300 residues was used as the criteria for proteins from

the RCSB protein data bank to expand slightly the selection ofproteins identified).

The next characteristic of this modular protein was theseven-residue repeat. The Internal Repeat Search Program isbuilt around a simple algorithm that analyzes the primarysequence of a protein in increments of seven tandem residuewindows, each of which could serve as the parent sequence.The rest of the protein is then compared to the parent seven-residue sequence, searching for other sets of seven tandemresidues within the protein with a high degree of similarity tothe parent sequence, and quantifying a similarity score usingthe values obtained from either the BLOSUM45 or BLOSUM62Scoring Matrix.26,27 Following each analysis, a score is generatedfor the specific parent seven-residue sequence window, andthe parent sequence window shifts down one residue to repeatthe entire analysis, so that the highest scoring window of sevenresidues for each protein can be determined (See example inFigure 1).

In addition to SmbP containing a seven-residue repeat, itwas also evident that some of these repeats occurred in tandemwith the next repeat, and could include two or three sets of

Figure 1. Description of Program Algorithm. Above is anexample of the calculations made using the Internal RepeatSearch Program for the hypothetical protein NP_870412.1 fromRhodopirellula baltica. The top shows the protein sequence withspecific residues labeled. The algorithm is performed using thespecific parameters of score (based on the BLOSUM62 matrixin this example) using the seven residue sequence starting atresidue 74 as the parent sequence (red), and listing any otherseven residue segments before or after this sequence that meeta minimum score (here selected as 13). In addition, sequencesfollowed immediately by an additional sequence meeting thesame score requirement, get an additional score (here 10, givento segments beginning at 11, 94, and 74, the parent segment,which does not get scored against itself, but in this instance,receives the additional score for the segment at 81). Segmentssuch as CCRTCEQ starting at 97 with a score of 17 are thrownout due to overlap of a higher scoring sequence (starting atresidue 94 with a score of 34). The algorithm performs thiscalculation for each segment of seven residues, starting at 1, andending here at 104, and generates the highest total scoringwindow of seven residues. Proteins meeting a minimal totalscore (150 using BLOSUM45 and 120 using BLOSUM62 in thiswork), were selected for further analysis.

research articles Barney

474 Journal of Proteome Research • Vol. 5, No. 3, 2006

Page 3: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

seven-residue repeats, along with linker regions which alsocontain similarities to other linker regions in the protein (Figure2A). On the basis of this attribute, an additional score was givento each repeat followed immediately by an additional seven-residue repeat to bias the selection of repeats which werearranged in a tandem manner. The algorithm may be biasedin this manner, which adds to the score of the current set ofseven-tandem residues. The program allows this feature to beselected, as was done in this work.

The final criteria were chosen to eliminate overlap andmultiple scoring arising from proteins with long sequences oflow complexity (polyalanine regions or incremental repeatssuch as GAGAGAGA), which do not fit the desired unique sevenresidue repeats found in SmbP. This problem was a commonoccurrence within many of the proteins found based on theinitial three selection criteria, and included the additionalproblem associated with single-residue repeating regions, orproteins consisting primarily of only two or three residues.While the algorithm does not ignore this possibility, it doeseliminate the multiple-scoring problem that is a consequenceof this problem. This is accomplished by scoring the highestframe first, and then eliminating any score for the six framesof seven-residues prior to, and following each scored seven-residue segment, and retaining only those sequences which donot overlap one another. An example of how a protein is scoredfor an actual protein sequence using this technique is illustratedin Figure 1.

Software and Platform. The algorithm was designed for useon a single PC computer as a stand-alone application usingVisual Basic (Microsoft) and the .net framework. Text files ofthe translated protein sequences from each of the genomesanalyzed in this work were obtained through the NCBI genomes

site (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Spe-cific genome sequences analyzed in this work are presentedin the supporting information. Sequences in the FASTA proteinformat were first processed by a format correction algorithm,to correct for any misuse of the “]” symbol, which is used bythe search algorithm to denote the end of the protein title, andthe beginning of the coding sequence. Once processed andformatted, the entire proteome was analyzed based on theselection criteria described above. The software used to performthis analysis is available at http://cc.usu.edu/∼bbarney/homepage.htm. In addition, all proteins uncovered in thecurrent database arranged in FASTA format are also availableat the same site.

Statistical Analysis. The statistical significance of proteinsimilarity can be calculated by a number of techniques.28-30

One preferred technique is to generate multiple randomsequences of the appropriate length and composition. Thismethod was selected here, to determine the likelihood thatmultiple sets of sequences with similarity to the sequenceidentified as the parent seven-residue sequence from eachprotein, would occur simply by random events. To analyze thelikelihood that this is the case, the Z-score for each identifiedparent sequence was calculated from a specific number ofrandomizations of the initial protein sequence. The search forsequences with a similarity to the parent sequence determinedfrom the actual high score of the original sequence was donefor each of the randomized sequences. In this manner, thecomposition and size of each analysis was maintained. TheZ-score was calculated using the common statistical equationand the parent sequence found for the actual sequence. Thesame parent sequence was used to calculate a score for eachof the randomizations, which were then used to calculate afinal Z-score. For the statistics calculated by the program, 100randomizations were performed, and the data were plotted sothat the percent of proteins above a specific Z-score in the twodatabases generated here could be compared for the analysisusing either BLOSUM45 or BLOSUM62.

Results and Discussion

Algorithm Parameters and Development. The aim of thiswork was to analyze the features of the primary sequence of arecently characterized modular protein, and use these featuresas a model to develop a search algorithm to uncover otherproteins with similar modular characteristics. The physicalproperties of the modular protein used as the model for thissearch have been reported,5 and characterization of this uniqueprotein is continuing, while the interesting features of theprimary sequence offer further avenues of study. To search foradditional proteins with small modular characteristics, thecharacteristics were first defined in terms of general criteria,to develop an algorithm to search the extensive proteindatabases for other proteins with similar features.

At first glance, the model protein SmbP can be described asa small protein with a unique seven-residue motif with anabsolutely conserved histidine in the fourth position (Figure2A). This motif is found 10 times in the primary sequence,existing in tandem in several locations. A closer examinationof the repeats, aligning each repeat below the next andconsidering residues outside the seven-residue repeat region,reveals similarities between these “linker regions” as well.Viewed in the context of modular design, SmbP might ratherbe defined as a simple protein made up of seven or twelveresidue modules, with each module containing the same level

Figure 2. Proteins from Nitrosomonas europaea with featuressimilar to SmbP. This figure shows SmbP (A), and a selection ofthree other proteins uncovered from Nitrosomonas europaeausing the algorithm described in this work. Shown are SmbP(NP_842452.1) (A), a hypothetical protein (NP_842243.1) (B), ahypothetical protein (NP_842232.1) (C), and a hypothetical protein(NP_842440.1) (D). Additional sequence regions not shown aredepicted as (..) for clarity. The seven-residue sequence(s) usedas the parent sequence which other segments were scoredagainst are shown in red, as are all absolutely conservedresidues. All were scored using the BLOSUM45 matrix.

Small Modular Proteins research articles

Journal of Proteome Research • Vol. 5, No. 3, 2006 475

Page 4: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

of conservation within the first seven residues (Figure 2A). Thealgorithm used in this work was designed to search for repeatsof seven-residue length, based on this minimal sized modulefrom the model sequence on which it was based.

As described under the materials and methods section, theselection criteria in this work is based on four specific require-ments; (1) a maximum size of 250 residues for proteinssurveyed in this work from various genome sequencing projects(300 residues from the RCSB protein data bank24), (2) asimilarity comparison of seven-residue segments based oneither the BLOSUM45 or BLOSUM62 matrix26,27 to calculate asimple score for each seven-residue segment, (3) an addedscore for sequences meeting the minimal score that arefollowed immediately by another set of seven-residues thatmeet the same minimal score requirement, and (4) selectionof only the highest scoring sequence of seven residues whenmultiple frames meet the minimum score requirement andoverlap one another. This final requirement selects the highestscoring repeat within each thirteen-residue frame, and elimi-nates excessive scores for regions of low conservation (i.e.,AAAGGGGGGAA). These selection rules allow the algorithm togenerate a score for each protein in a genome, and narrow theanalysis down to the few proteins which fit the numericalscoring requirements in this classification. A more extensiveanalysis as well as statistical comparisons for the significanceof the repeat identified may then proceed on each individualbasis.

Database Construction From Prokaryotic Genomes. Oncethe algorithm was developed and demonstrated to be capableof identifying the repeats in the model sequence, the first aimwas to search for additional proteins from Nitrosomonaseuropaea with similar modular features to those found in themodel sequence of SmbP, as specified from the criteria statedabove. Though the algorithm is biased for repeats of sevenresidues, it is also capable of uncovering proteins with largerrepeating units, though the scoring in this work is based onlyon the first seven. The specifics of the seven-residue segmentanalysis make it inferior in determining segments larger than10, where other programs such as RADAR or TRUST10,13 performmuch better, though in some instances, it will find these aswell.

An analysis of known and putative proteins from the genomeof N. europaea revealed a number of proteins with smallmodular features. Two of these hypothetical proteins (Figure2B,C), are similar in nature, with the former containing a highlyconserved DQRF sequence within the repeat, and the lattercontaining a highly conserved DKRF sequence, and both areassembled in modules of seven or eleven residues. The otherhypothetical protein (Figure 2D) is an assembly of 9, 10, or 11residues, with varied degrees of conservation at each of theresidues. In total, nine proteins were found in the N. europaeagenome that met the score requirements used for the initialscan of all genomes, based on similarities as defined by theBLOSUM45 Matrix.26,27

Following the analysis of the N. europaea genome, a broad-ened search of other completed genomes was instigated todevelop a more extensive database of these modular proteinscontaining similar features for further characterization andcomparison. The complete database compiled from the analysisof more than one hundred bacterial and archaeal genomes isavailable online (http://cc.usu.edu/∼bbarney/homepage.htm),and a list of the genomes searched in this work is available assupplementary information. This work yielded a large variety

of proteins with vastly different characteristics, and furtherclassification could be handled in a variety of ways.

Determination of Repeat Significance. There are a numberof different techniques that are used by BLAST searches fordetermining the significance of alignments.28-30 Several differ-ent approaches were applied to the databases generated hereto assess the significance of the repeats found. The problemof bias is of great concern, especially for a protein sequenceconsisting primarily of repeating units, where the complexityof the sequence (here defined as the percent of various aminoacid residues in the specific protein versus relative percents inall proteins as a whole), can in many cases be weighted veryheavily in favor of only a few specific residues. The approachused here to assess the significance of repeats was to firstdetermine the seven-residue “parent sequence” from eachprotein with the highest similarity score based on either theBLOSUM45 or BLOSUM62 similarity matrix, and then random-ize the sequence and determine the similarity score of the same“parent sequence” again for the new sequence followingrandomization. In each case, the score was calculated usingall the same parameters. Based on this analysis, performed 100times for each sequence meeting the minimum score require-ments, a Z-score was calculated. One flaw in this calculationis that the original “parent sequence” does not get scoredagainst itself (though additional absolutely conserved seven-residue segments are scored). However, in the randomizations,the entire sequence is scored against the parental sequence.This statistical check was favored as a means of investigatingsequences which might arise by random events due to lowsequence diversity.

The Z-score allows one to estimate the probability that avalue is significantly different from another value. In this work,the value obtained based on the parental sequence for theactual protein, and the mean and standard deviation for thescores obtained from 100 randomizations scored again versusthe parental sequence are used to give a numeric value whichcan be used to approximate probabilities of the same scorebeing obtained by random chance. In this manner, higher Zscores indicate a higher confidence for the significance of therepeat. From the plot shown in Figure 3, it is clear that themajority of proteins from the database generated using BLO-SUM62 have a much higher Z-score than those obtained usingBLOSUM45. This is largely related to the lower overall scoregiven to similar, though not identical residues from the twomatrixes, resulting in a large number of the proteins from theBLOSUM45 database being discarded following reanalysis usingBLOSUM62. Though both databases contain some proteinswith Z-scores below 1.0, many of these contained larger repeats,for which residues beyond the first seven were not accountedfor in the statistical calculation. For this reason, the statisticaltest was not applied here to screen and filter out such proteins,in favor of retaining the database for further analysis once thealgorithm is modified and improved upon to include other sizesbetween 5 and 10 residues, and once more extensive statisticaltests are developed based on these improved algorithms. Asan example of this point, the hypothetical protein fromPseudomonas aeruginosa (NP_250880.1, not shown) yielded avery low Z-score based only on these seven residues of therepeat segment. A simple visual inspection indicated that thisprotein contained larger repeats, as identified by both theRADAR13 and TRUST10 programs. When the same calculationwas performed based on a 10 residue segment, the Z-scoreimproved dramatically.

research articles Barney

476 Journal of Proteome Research • Vol. 5, No. 3, 2006

Page 5: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

One of the primary problems associated with definingrepeats, is the determination of sequence complexity. Thoughtechniques exist to deal with such complexity, these may behighly biased and subjective. This problem is circumvented bysome web-based BLAST searches by ignoring regions of lowcomplexity through the use of filters. Attempts were made toremove proteins from these databases based on low complexity(here defined as proteins consisting of 50% or greater contentof three or less specific amino acids), but visual inspection andthe Z-Score determined for many proteins with low complexitysupported the belief that for many of these proteins, the repeatsare significant enough as to not have been formed by simplerandom events, and are instead the result of a modularmechanism.

Examples of low complexity proteins are shown in Figure 4,and include the glycine rich RNA binding region of RNP-1(Figure 4A), which has a total content as translated of almost40% glycine, and nearly 58% glycine in the region shown. Manyof the glycine residues in the protein occur in tandem withother runs of glycine, resulting in a protein that lacks significantcomplexity, and could easily be aligned in many ways to appearas though it contains repeats. The properties would make sucha protein a prime target for removal by use of a filter. However,these low-complexity proteins also include the morphogeneticprotein from Bacillus subtilis (Figure 4C), which contains aseries of thirteen-residue and seven-residue modular repeatswith very high conservation, even though nearly 59% of thetotal protein consists of lysine, histidine and serine residues.Histone protein H1 from Xanthomonas axonopodis (Figure 4D)and a hypothetical protein from Mycobacterium leprae (Figure4B) are also shown, which also exhibit modular characteristicsfor proteins with low-complexity. All of these repeats had highZ-scores, and would likely be much higher if scored as nine,six, or eight residue repeats. They were also identified by eitherthe RADAR13 or TRUST10 programs, though RADAR always

identified these as larger repeats composed of several tandemsegments and TRUST only identified a portion of the repeatsidentified by this algorithm. These results should not besurprising, if one again considers SmbP, the model used todevelop this algorithm. Following cleavage of the leadersequence,5 more than 47% of SmbP is composed of onlyalanine, glutamate and histidine, and is completely devoid ofsix of the other 20 standard residues. This simplicity is anoverall theme of such a modular design for repeating sequencesin proteins.

The Seven Residue Repeat Unit. An initial goal of this workwas to specifically investigate the utility of the seven-residueunit, and the prominence of this specifically sized repeat unitin various proteins. The simplest modular protein of seven-residues would be a tandem repeat of seven-residues, for whichthe algorithm may be, and in this work was, purposely biased.Though there were many proteins containing sets of tandemrepeats, several found exhibited a sequence predominantly ofseven residue repeats (Figure 5A-C), illustrating the utility ofthe simple seven-residue repeat. Larger repeats of 8 (Figure 5D),9, 10, and 11 residues were also found (not shown, but easilylocated within the database generated here). One disappointingfeature of these extensive tandem repeats is that all are

Figure 3. Statistical Z-Scores. Shown above is a curve represent-ing the percent of proteins from the databases generated usingeither the BLOSUM45 (Blue) or the BLOSUM62 (Red) similaritymatrix and the corresponding Z-Score calculated (for the identi-fied seven-residue parent sequence) from 100 randomizationsof the same primary sequence. The number of sequences in thecurrent database is shown for each on the graph.

Figure 4. Proteins with Low Complexity. Shown above areproteins with a high percentage of only three residues. Theproteins presented are the RNA-binding region RNP-1 (RNArecognition motif) from Synechococcus sp. (NP_896112.1) (A), ahypothetical protein from Mycobacterium leprae (NP_301480.1)(B), a morphogenetic protein from Bacillus subtilis (NP_391488.1)(C), and histone H1 from Xanthomonas axonopodis (NP_643367.1)(D). Additional sequence regions not shown are depicted as (..)for clarity. The seven-residue sequence(s) used as the parentsequence which other segments were scored against are shownin red, as are all absolutely conserved residues. All were scoredusing the BLOSUM45 matrix.

Small Modular Proteins research articles

Journal of Proteome Research • Vol. 5, No. 3, 2006 477

Page 6: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

classified as hypothetical proteins, indicating that no knownfunctions for these have been established, and confirmationthat such proteins are actually expressed by the cells is as yetuncertain. The seven residue repeat has been found in coiled-coils17,20,21 and also in transmembrane regions from some viralproteins,22,23 and was recently shown to be utilized in a metalbinding protein.5 It is likely that further functions for this repeatsize will eventually be determined as well.

Predominance of Repeats in Certain Species. The broad-ened search of genomes was performed to build a database ofproteins with modular properties, and examine the extent ofsmall modular proteins in various organisms. This analysisyielded many interesting features of these modular proteinsin regards to their predominance throughout the eubacterialand archaeal worlds. In many organisms, such as Escherichiacoli, proteins meeting these scoring requirements representedless than 1% of the total proteins smaller than 250 residues(16 of 1852 for Escherichia coli K-12), whereas in others, suchas several species of Mycoplasma, these proteins account for5-8% of the proteins smaller than 250 residues (16 of 287in Mycoplasma pneumonia, and 34 of 405 in Mycoplasmapenetrans), and yet, in other species of Mycoplasma, these arefar less prevalent.

For species such as Mycoplasma, which are among thesmallest genome sizes and physical dimensions for any knowncells, and are of great interest from the perspective of theminimum genome concept,31 the role of internal-repeat pro-teins is of considerable interest. Several lipoproteins withinternal-repeat sequences from Mycoplasma have been shownto play a role in ciliary binding,32 and have been linked to thevirulence of these strains. Virulence in some Mycoplasmas canbe attenuated by successive passage in laboratory media,33-35

and extensive passes can lead to significant reductions in ciliaadherence.36 Other findings for a region of tandem repeats froma 94 kDa antigen from Mycoplasma hyopneumoniae which ispurportedly not necessary for ciliary binding, but will raiseantibodies, has led to speculation that such repeats might serveas immune decoys.32 More recent work has linked variation inlengths of repeat regions of Mycoplasma proteins to virulenceand resistance to complement killing,37,38 though it has beenstated that a clear understanding of the functional conse-quences of the variation in numbers of repeats remainsrelatively unclear,39 and the study of the role that internal-repeat regions play in pathogenicity remains a significantavenue for future study.

Part of the intrigue of small genomes is linked to theevolution of Mycoplasma and similar organisms, which arebelieved to have evolved from more complex bacteria throughthe loss of genes of unnecessary functions, resulting in aminimal set of essential genes.31 Genome sequencing projectshave become an invaluable source of information for under-standing the required components to sustain life, and a largenumber of genome sequencing projects have been completed,including 10 for Mycoplasmas, or related organisms. Yet, it isobvious that the raw genomic information alone is not suf-ficient to fully characterize a complete genome. This is perhapsbest illustrated by examining the recently sequenced Myco-plasma hyopneumoniae, where only 44% of the predictedproteins could be assigned functions, and 18% of the hypo-thetical proteins found are unique to this specific species.40

Thus, unique hypothetical and conserved hypothetical proteins,including those containing internal repeats, are likely to playan important role in a genome specialized to function on aminimal set of genes.

Several examples of small proteins containing internalrepeats from Mycoplasma species are shown in Figure 6. All ofthese proteins contain extensive internal repeats, similar to thetandem repeats reported for various lipoproteins discussedabove. All of the proteins presented in Figure 6 fit therequirements of a modular design using either 7 or 14 residuemodules in various degrees to arrive at the final primarysequence. The specific cellular location or functions of thesehypothetical proteins is difficult to ascertain with any confi-dence using standard signaling motifs, as many species ofMycoplasma contain abbreviated membrane protein secretorysystems,40 and may contain additional, as yet uncharacterizedsystems of protein trafficking. Further characterization of theinternal repeat proteins described here will likely requirebiochemical techniques. However, in view of the needs of anorganism which must evade or fight off an immunologic attackfrom the host in which it resides, the presence of such repeatsis an intriguing feature.

Dilemma of Proteins with Unknown Function. A significantnumber of the proteins uncovered during this work fit withinthe category of protein singletons (proteins with no otherhomologous protein in the current databases), and are often

Figure 5. Extensive Tandem Proteins. Shown above are smallproteins dominated by a simple tandem repeated sequence. Theproteins presented are a hypothetical protein from Xylellafastidiosa (NP_297916.1) (A), a conserved hypothetical proteinfrom Thermoanaerobacter tengcongensis (AAM23913.1) (B), ahypothetical protein from Yersinia pestis (NP_670964.1) (C), anda hypothetical protein from Xylella fastidiosa (NP_297344.1) (D).Additional sequence regions not shown are depicted as (..) forclarity. The seven-residue sequence(s) used as the parent se-quence which other segments were scored against are shownin red, as are all absolutely conserved residues. All were scoredusing the BLOSUM45 matrix.

research articles Barney

478 Journal of Proteome Research • Vol. 5, No. 3, 2006

Page 7: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

labeled “unique hypothetical” or simply “hypothetical” versus“conserved hypothetical” proteins, where homologues havebeen found in other species. While these findings are alwayssusceptible to change, especially in light of the many genomesequence projects underway, it is likely that many of theproteins evolved within specific species, indicating that themechanism of modular protein design based on repeats isutilized throughout the microbial world, and can arrive atunique and functional proteins within individual species.

Many of the proteins (slightly more than half of thoseincluded in the database using the BLOSUM45 matrix forsimilarity determinations) uncovered during this analysis arehypothetical or conserved hypothetical proteins, illustratinghow little is known about the functions of proteins containingmodular repeats. Like many of the proteins highlighted in thiswork, the model protein used to develop the algorithm had

been characterized as a hypothetical protein prior to itsisolation and characterization.5 It is likely that many of thesehypothetical proteins described here are also actual functionalproteins with a unique role in the life of the organism fromwhich they originate. Further analysis of the proteins foundhere based on this modular design, and focusing on the highlyconserved residues of various repeats could assist in assigningprobable functions for individual proteins. This is well il-lustrated with the hypothetical protein YhjQ (NP_388941.1)from Bacillus subtilis, which contains a repeat with conservedcysteine or methionine residues at positions one and five (notshown). Many examples of proteins containing this repeatingset of sulfur residues were found, and align together with YhjQ,including a proposed ferredoxin from Clostridium acetobutyli-cum (NP_346720.1) (Figure 7H). Though the level of conserva-tion in many of these proteins was fairly low for residuesbesides these cysteines, the multiple alignment showed manysimilarities, indicating divergence from a common ancestor,while also providing a possible function for proteins such asYhjQ. The investigation of proteins from the RCSB protein databank24 found an example of cysteines spaced in a similarmanner in a modular protein from the trypsin inhibitor frombarley (RCSB protein data bank file 1C2A41), though thesecysteines are not involved in the binding of any small mol-ecules, but instead are part of an extensive set of disulfidebridges.

Known Proteins with Modular Features. The broad varietyof proteins with primary sequences found to contain thismodular design of small component pieces (available withinthe database) consists predominantly of hypothetical or con-served hypothetical proteins. Though functions are not as-signed to the majority of these proteins, other proteins withknown roles in the cell were also uncovered. In addition toproteins from the microbial species selected in this work, thesearch was also performed on nonredundant sequences fromthe RCSB protein data bank.24 It should be clarified that thissearch includes eukaryotes as well as prokaryotes, and that thesearch parameters for this database were broadened to includeproteins of a slightly larger size (300 residues). The analysis ofproteins with known functions provides information on thevariety of roles for proteins containing modular repeats. Inaddition, for proteins where a structure is available, theserepeats can provide an indication of whether the repeats serveda structural or functional role in the individual proteins. Table1 provides some examples of some proteins with knownfunction that were identified as containing small repeats usingthis algorithm, including some proteins for which a structurehas been solved. A review of these proteins, highlighting therepeats, illustrates that these repeats can serve a role both inbinding, as found for the choline binding proteins,42 or serve astructural role, such as in transmembrane regions, as is foundin reaction centers.43-45 There were no specific distinctionsbetween secondary structural elements of the many structuressurveyed as part of this analysis, with repeats found in bothR-helices and â-sheets. A more thorough discussion of all theproteins found is beyond the scope of this paper.

Versatility and Implications of Modular Proteins and SmallRepeat Units. There are a number of classifications of repeatcontaining proteins,2,8,9,17,46 and new families of repeats arebeing discovered routinely.47 In many cases, these internalrepeats are large, and often referred to as domain repeats,though smaller tandem repeats of four or five residues havealso been described.47 Repeat proteins have received greater

Figure 6. Proteins from Mycoplasma with modular features.Shown above are small proteins dominated by a simple tandemrepeated sequence. The proteins presented are a hypotheticalprotein from Mycoplasma pneumoniae (NP_109825.1) (A), ahypothetical protein from Mycoplasma pneumoniae (NP_109788.1)(B), a conserved hypothetical protein from Mycoplasma pen-etrans (NP_758139.1) (C), a hypothetical protein from Myco-plasma pneumoniae (NP_109826.1) (D), a hypothetical proteinfrom Mycoplasma pneumoniae (NP_110212.1) (E), and a con-served hypothetical protein from Mycoplasma pneumoniae(NP_110153.1) (F). Additional sequence regions not shown aredepicted as (..) for clarity. The seven-residue sequence(s) usedas the parent sequence which other segments were scoredagainst are shown in red, as are all absolutely conservedresidues. All were scored using the BLOSUM45 matrix.

Small Modular Proteins research articles

Journal of Proteome Research • Vol. 5, No. 3, 2006 479

Page 8: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

attention in recent years,3,4,8,10 particularly in the role of smallmolecule binding.9 The search criteria described and utilizedhere are direct, based on several key parameters that definethis classification in regards to the modular nature of simplerepeating sequence proteins, and allows one to search throughan entire genome for proteins satisfying these requirements.The algorithm uses the BLOSUM45 or BLOSUM62 scoringmatrixes,26 which were selected based on the ability to identifythe repeats of the model protein. While this yielded the repeatsin the desired manner, future efforts may be aimed at applyingor developing other matrixes, which would be more specificfor small modular repeat searches, or more specific sequences(i.e., repeats containing tyrosine). The program is easily ame-nable to make such an analysis, by allowing the user to biasthe matrix for specific residues.

One resulting bias of the algorithm developed in this workis that in order for the score requirement to be met, the searchmust typically find several regions of high similarity (usuallyat least 4-5), versus algorithms which might overlook this levelof similarity for smaller repeating modules in favor of largerdomain repeats. A survey of proteins identified using thisalgorithm, when analyzed using several established algorithmsfor analyzing repeat containing proteins,3,10,12,13 revealed someof the differences and benefits of applying multiple algorithmsfor the analysis of these repeat containing proteins. The primarybenefit of this algorithm was in identifying the smallest basicrepeat unit, whereas other algorithms often combined thesesmaller units into several larger single repeat units (14 versusthe 7 used here).

The ability of proteins to reach function in a rapid manner,even over evolutionary time scales, is a daunting task. Evidencein support of a modular mechanism in protein evolution versussimple sets of highly conserved tandem repeats (Figure 5) is

provided by the existence of proteins with interchanged setsof modules of different sizes. In Figure 7, a number of proteinswith various degrees and combinations of modular segmentsare presented. On the basis of the vast variety of proteins thatcould be assembled from a sequence of 100 residues with apossibility of 20 (or more) different residues at each position,it seems logical that nature would have chosen a pathwaywhere sequences build and expand based on templates orsegments which function well for one reason or another. Thesmall, highly conserved fragments might then be reciprocatedthroughout the protein sequence as it evolves, reaching func-tionality far faster than could be achieved by simple trial anderror approaches, or single residue modifications.

The concept of modular protein design discussed here iscertainly not novel.7,10,15-17 It has been stated previously thatthe existence of repeating segments of residues should not besurprising.4 Though this duplication of sequences within aprotein is proposed as a likely route to achieving rapid proteinevolution, the extent to which this occurs in nature is difficultto assess. A goal of this work is to devise techniques to searchfor, and then determine the significance of such modularfeatures in proteins. Once identified, such modular designfeatures could serve as templates for the design of new de novoproteins, or possibly reveal more fundamental features of theearly events in evolution. Many of the proteins shown in Figure7 reveal patterns by which modules might be combined. Asthe databases containing such examples expand, further lessonsin how modular proteins are assembled might be revealed. Thealgorithm and databases described here highlight specificproteins utilizing this design structure, and could have furtherutility in the search for functions in unknown or hypotheticalproteins, as well as in the development and design of tech-niques for directed evolution studies.

Table 1. Examples of Proteins Identified with Known Functions

role BLOSUM45 similarities26 RCSB Protein Data Bank24

Proteins Binding Secreted Calcium Binding Protein Choline Binding Protein42

Small Molecules Internal Calcium Binding Protein (RCSB pdb file 1HCX)Divalent Heavy-Metal Cation TransporterChromate Transport ProteinZinc Uptake ProteinGlycerol Uptake Facilitator ProteinCholine Binding ProteinFerric Siderophore Transport System

Periplasmic Binding ProteinVarious DNA Binding Proteins

Structural Proteins Aquaporin Photosynthetic Reaction Centers43-45

Permeases (RCSB pdb file 1YST, 1DXR andPhotosynthetic Reaction Centers 1EYS)LipoproteinsInner and Outer Spore Coat Proteins

Nucleotide HistonesAssociated Proteins Ribosomal Proteins

Subunits of ATP SynthaseRNA Polymerase Subunits and FactorsATPaseDNA Topoisomerase

Proteins Associated Antifreeze Proteins Antifreeze Protein48

with Environmental Morphogenetic Protein (RCSB pdb file 1L1I)Responses General Stress ProteinsCatabolism and ProteasesPathogenicity Hemolysins

XylanaseUrease Accessory Protein

Redox Proteins Cytochromes Nine-Heme Cytochrome49

Formate-Dependent Nitrite Reductase (RCSB pdb file 19HC)NADH-Ubiquinone OxidoreductaseThiol:Disulfide Interchange Proteins

research articles Barney

480 Journal of Proteome Research • Vol. 5, No. 3, 2006

Page 9: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

It is clear that the number of repeats or frequency of repeatsthat are found within a set of proteins or a full genome is highlybiased by the technique used to define the repeat. A variety ofalgorithms are already available for fast analysis of proteinsequences to find regions of repeating sequences.3,10,12,13 In thiswork, the extent of small repeating segments based on aminimal size of seven residues has been analyzed in a varietyof eubacterial and archaeal species. Seven residue repeats havebeen highlighted as a feature of coiled-coils,17 and also for theirrole in the transmembrane regions from a number of viralproteins.22,23

The seven residue similarity comparison used here demon-strates some utility for larger repeats as well, though obviousimprovements could be made. Future work will be directed atdifferentiating between other sizes, between 5 and 10 residuesegments, to improve the scoring for these varied sizes of

repeats, and increase the confidence in the reported results,while also broadening the database of proteins further. Themaximum size of proteins analyzed might also be expanded,and more rigorous testing of significance implemented. Finally,new matrices designed around improving the confidence inthe significance of the repeat might also be developed. Thedatabase of proteins obtained from this work provides avaluable first step in such efforts.

Acknowledgment. I wish to thank Robert Y. Igarashiand Wilson A. Francisco for reviews of this manuscript, WilsonA. Francisco for support in the preceding project that directedme to this work, Michael C. Minnotte for helpful discussionsregarding statistical tests of the data generated by this algo-rithm, and Nita Deshpande for assistance obtaining sequencesfrom the RCSB protein data bank.

Figure 7. Modular Proteins from Various Organisms. Shown above are small proteins with a modular architecture. The proteinspresented are a hypothetical protein from Clostridium tetani (NP_782132.1) (A), a hypothetical protein from Archaeoglobus fulgidus(NP_068842.1) (B), a hypothetical protein from Deinococcus radiodurans (NP_293831.1) (C), a hypothetical protein from Photorhabdusluminescens (NP_929061.1) (D), a hypothetical protein from Synechocystis (NP_440405.1) (E), a hypothetical protein from Yersiniapestis (NP_670964.1) (F), a hypothetical protein from Chromobacterium violaceum (NP_900902.1) (G), ferredoxin from Clostridiumacetobutylicum (NP_346720.1) (H), a hypothetical protein from Xylella fastidiosa (NP_297971.1) (I), and a hypothetical protein fromBacillus licheniformis (YP_091925.1) (J). Additional sequence regions not shown are depicted as (..) for clarity. The seven-residuesequence(s) used as the parent sequence which other segments were scored against are shown in red, as are all absolutely conservedresidues. All were scored using the BLOSUM45 matrix.

Small Modular Proteins research articles

Journal of Proteome Research • Vol. 5, No. 3, 2006 481

Page 10: Classification of Proteins Based on Minimal Modular Repeats:  Lessons from Nature in Protein Design

Supporting Information Available: Specific genomesequences analyzed in this work. Databases of sequencesmeeting the internal repeat requirements based on the BLO-SUM45 or BLOSUM62 scoring matrices, as obtained from thisanalysis. This material is available free of charge via the Internetat http://pubs.acs.org.

References

(1) Li, Y.-C.; Korol, A. B.; Fahima, T.; Beiles, A.; Nevo, E. Mol. Ecol.2002, 11, 2453-2465.

(2) Soding, J.; Lupas, A. N. Bioessays 2003, 25, 837-846.(3) Pellegrini, M.; Marcotte, E. M.; Yeates, T. O. Proteins: Struct.,

Funct., Genet. 1999, 35, 440-446.(4) Wootton, J. C. Curr. Opin. Struct. Biol. 1994, 4, 413-421.(5) Barney, B. M.; LoBrutto, R.; Francisco, W. A. Biochemistry 2004,

43, 11206-13.(6) Heringa, J. Curr. Opin. Struct. Biol. 1998, 8, 338-345.(7) McLachlan, A. D. J. Mol. Biol. 1972, 64, 417-437.(8) Marcotte, E. M.; Pellegrini, M.; Yeates, T. O.; Eisenberg, D. J. Mol.

Biol. 1999, 293, 151-160.(9) Andrade, M. A.; Perez-Iratxeta, C.; Ponting, C. P. J. Struct. Biol.

2001, 134, 117-131.(10) Szklarczyk, R.; Heringa, J. Bioinformatics 2004, 20 Suppl 1, I311-

I317.(11) Marcotte, E. M.; Pellegrini, M.; Thompson, M. J.; Yeates, T. O.;

Eisenberg, D. Nature 1999, 402, 83-86.(12) Kreil, D. P.; Ouzounis, C. A. Bioinformatics 2003, 19, 1672-1681.(13) Heger, A.; Holm, L. Proteins: Struct., Funct., Genet. 2000, 41, 224-

237.(14) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J. H.; Zhang,

Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res. 1997, 25, 3389-3402.

(15) Tripp, K. W.; Barrick, D. J. Mol. Biol. 2004, 344, 169-178.(16) Stumpp, M. T.; Forrer, P.; Binz, H. K.; Pluckthun, A. J. Mol. Biol.

2003, 332, 471-487.(17) Tompa, P. Bioessays 2003, 25, 847-855.(18) Marcotte, E. M.; Eisenberg, D. Biochemistry 1999, 38, 667-676.(19) Garnett, A. P.; Viles, J. H. J. Biol. Chem. 2003, 278, 6795-6802.(20) Groves, M. R.; Barford, D. Curr. Opin. Struct. Biol. 1999, 9, 383-

389.(21) Lupas, A. N.; Ponting, C. P.; Russell, R. B. J. Struct. Biol. 2001,

134, 191-203.(22) Kingsley, D. H.; Behbahani, A.; Rashtian, A.; Blissard, G. W.;

Zimmerberg, J. Mol. Biol. Cell 1999, 10, 4191-200.(23) Netter, R. C.; Amberg, S. M.; Balliet, J. W.; Biscone, M. J.;

Vermeulen, A.; Earp, L. J.; White, J. M.; Bates, P. J. Virol. 2004,78, 13430-13439.

(24) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.;Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res.2000, 28, 235-242.

(25) Corpet, F. Nucleic Acids Res. 1988, 16, 10881-10890.(26) Henikoff, S.; Henikoff, J. G. Proc. Nat’l Acad. Sci., U.S.A. 1992,

89, 10915-10919.(27) Luthy, R.; Xenarios, I.; Bucher, P. Protein Sci. 1994, 3, 139-146.(28) Altschul, S. F.; Erickson, B. W. Mol. Biol. Evol. 1985, 2, 526-538.(29) Lipman, D. J.; Wilbur, W. J.; Smith, T. F.; Waterman, M. S. Nucleic

Acids Res. 1984, 12, 215-226.(30) Fitch, W. M. J. Mol. Biol. 1983, 163, 171-176.(31) Hutchison, C. A., III; Montague, M. G. Mycoplasmas and the

Minimal Genome Concept. In Molecular Biology and Pathogenic-ity of Mycoplasmas; Razin, S., Herrmann, R., Eds.; KluwerAcademic/Plenum Publishers: New York, 2002.

(32) Wilton, J. L.; Scarman, A. L.; Walker, M. J.; Djordjevic, S. P.Microbiology 1998, 144 (Pt 7), 1931-1943.

(33) Collier, A. M.; Hu, P. C.; Clyde, W. A., Jr. Diagn. Microbiol. Infect.Dis. 1985, 3, 321-328.

(34) Zielinski, G. C.; Ross, R. F. Am. J. Vet. Res. 1993, 54, 1262-1269.(35) Zielinski, G. C.; Ross, R. F. Am. J. Vet. Res. 1990, 51, 344-348.(36) Zhang, Q.; Young, T. F.; Ross, R. F. Infect. Immun. 1994, 62, 1616-

1622.(37) Tu, A. H.; Clapper, B.; Schoeb, T. R.; Elgavish, A.; Zhang, J.; Liu,

L.; Yu, H.; Dybvig, K. Infect. Immun. 2005, 73, 245-249.(38) Simmons, W. L.; Denison, A. M.; Dybvig, K. Infect. Immun. 2004,

72, 6846-6851.(39) Yogev, D.; Browning, G. F.; Wise, K. S. Genetic Mechanisms of

Surface Variation. In Molecular Biology and Pathogenicity ofMycoplasmas; Razin, S., Herrmann, R., Eds.; Kluwer Academic/Plenum Publishers: New York, 2002.

(40) Minion, F. C.; Lefkowitz, E. J.; Madsen, M. L.; Cleary, B. J.;Swartzell, S. M.; Mahairas, G. G. J. Bacteriol. 2004, 186, 7123-7133.

(41) Song, H. K.; Kim, Y. S.; Yang, J. K.; Moon, J.; Lee, J. Y.; Suh, S. W.J. Mol. Biol. 1999, 293, 1133-1144.

(42) Fernandez-Tornero, C.; Lopez, R.; Garcia, E.; Gimenez-Gallego,G.; Romero, A. Nat. Struct. Biol. 2001, 8, 1020-1024.

(43) Arnoux, B.; Gaucher, J. F.; Ducruix, A.; Reiss-Husson, F. ActaCrystallogr. Sect. D Biol. Crystallogr. 1995, 51, 368-379.

(44) Lancaster, C. R.; Bibikova, M. V.; Sabatino, P.; Oesterhelt, D.;Michel, H. J. Biol. Chem. 2000, 275, 39364-39368.

(45) Nogi, T.; Fathir, I.; Kobayashi, M.; Nozawa, T.; Miki, K. Proc. Nat’lAcad. Sci. U.S.A. 2000, 97, 13561-13566.

(46) Katti, M. V.; Sami-Subbu, R.; Ranjekar, P. K.; Gupta, V. S. ProteinSci. 2000, 9, 1203-1209.

(47) Adindla, S.; Inampudi, K. K.; Guruprasad, K.; Guruprasad, L.Comp. Funct. Genomics 2004, 5, 2-16.

(48) Daley, M. E.; Spyracopoulos, L.; Jia, Z.; Davies, P. L.; Sykes, B. D.Biochemistry 2002, 41, 5515-5525.

(49) Matias, P. M.; Coelho, R.; Pereira, I. A.; Coelho, A. V.; Thompson,A. W.; Sieker, L. C.; Gall, J. L.; Carrondo, M. A. Struct. FoldingDesign 1999, 7, 119-30.

PR050103M

research articles Barney

482 Journal of Proteome Research • Vol. 5, No. 3, 2006