bioinformatics vol.21suppl.12005,pagesi328–i337talp/publications/patchfinder.pdf · “bti1023”...

ldquobti1023rdquo mdash 2005610 mdash page 328 mdash 1

BIOINFORMATICS Vol 21 Suppl 1 2005 pages i328ndashi337doi101093bioinformaticsbti1023

In silico identification of functional regions inproteins

Guy Nimrod1 Fabian Glaser2 David Steinberg3 Nir Ben-Tal1lowastand Tal Pupko4

1Department of Biochemistry George S Wise Faculty of Life Sciences Tel AvivUniversity Ramat Aviv 69978 Israel 2European Bioinformatics Institute Wellcome TrustGenome Campus Cambridge CB10 1SD UK 3Department of Statistics andOperations Research Raymond and Beverly Sackler Faculty of Exact Sciences and4Department of Cell Research and Immunology George S Wise Faculty of LifeSciences Tel Aviv University Ramat Aviv 69978 Israel

Received on January 15 2005 accepted on March 27 2005

ABSTRACTMotivation In silico prediction of functional regions on proteinsurfaces ie sites of interaction with DNA ligands sub-strates and other proteins is of utmost importance in variousapplications in the emerging fields of proteomics and struc-tural genomics When a sufficient number of homologs isfound powerful prediction schemes can be based on theobservation that evolutionarily conserved regions are oftenfunctionally important typically only the principal functionallyimportant region of the protein is detected while second-ary functional regions with weaker conservation signals areoverlooked Moreover it is challenging to unambiguouslyidentify the boundaries of the functional regionsMethods We present a new methodology called PatchFinderthat automatically identifies patches of conserved residuesthat are located in close proximity to each other on the pro-tein surface PatchFinder is based on the following steps(1) Assignment of conservation scores to each amino acidposition on the protein surface (2) Assignment of a scoreto each putative patch based on its likelihood to be func-tionally important The patch of maximum likelihood is con-sidered to be the main functionally important region and thesearch is continued for non-overlapping patches of secondaryimportanceResults We examined the accuracy of the method using theIGPS enzyme the SH2 domain and a benchmark set of 112proteins These examples demonstrated that PatchFinder iscapable of identifying both the main and secondary functionalpatchesAvailability The PatchFinder program is available at httpashtorettauacilsimnimrodgContact NirBtauextauacil

lowastTo whom correspondence should be addressed

1 INTRODUCTIONDetection of the amino acid positions that are essential foractivities such as catalysis proteinndashprotein interactions orproteinndashligand interactions is a critical step in the study of thebiological function of proteins (Lichtarge and Sowa 2002)This task is especially important for proteins with a knownstructure and unknown function Detection of key amino acidpositions that are functionally important is also essential fordrug design studies for protein classification and annotationand for evolutionary studies (Rost 2002) This need led tothe development of a wide variety of computational methodsfor the identification of functional regions (Aloyet al 2001Amitai et al 2004 del Sol Mesaet al 2003 Friedberg andMargalit 2002 Jones and Thornton 1997 Inniset al 2004Ma et al 2003 Madabushiet al 2002 Neuvirthet al 2004Panchenkoet al 2004 Pazos and Sternberg 2004)

Functionally important positions are often conserved andmany methods exploit the evolutionary history of the pro-tein and its homologs Dean and Golding (2000) developed amaximum likelihood (ML)-based methodology to discrimin-ate between slowly and rapidly evolving regions of a proteinUnlike other methods their algorithm does not assign a con-servation score for each amino acid position but rather assignsan estimated rate of evolution to a group of amino acids thatare included in a sphere of variable diameter The strengthof their approach is the use of an explicit stochastic processto model protein evolution However by considering onlyspherical shapes patches of shapes that are far from sphericalare likely to be overlooked Furthermore the method doesnot distinguish between exposed and buried amino acid pos-itions Hence the predicted functional sphere may includesome amino acid positions that are buried in the protein andare most probably structurally rather than functionally import-ant Dean and Golding further used a statistical test to checkthe rate significance for each sphere comparing it with therate distribution of a random population

i328 copy The Author 2005 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupjournalsorg


PatchFinder

Aloy et al (2001) used a method based on EvolutionaryTrace (ET Lichtargeet al 1996) to assign conservationscores to each position The ET method uses a phylogen-etic tree of the query proteinrsquos family and detects residues thatshow evolutionary conservation in at least a subgroup (branch)of the tree ie trace residues Aloyet al consider as candid-ates for functional patches only sites with amino acids thatare polar and totally conserved (or with one lsquoresidue sequenceerrorrsquo) This definition suffers from low sensitivity an aminoacid position is either lsquoconservedrsquo or not In practice thereare many levels of conservation and the exclusion of posi-tions that are not totally conserved or not polar may obscurefunctional regions Moreover the method is very sensitive tothe choice of the homologous protein sequences that are usedthe addition of more homologs is expected to reduce the num-ber of putative positions thereby counter-intuitively reducingthe power of the method Aloyet alrsquos algorithm also assumesthat the shape of the patch is spherical which may not alwaysbe suitable

The algorithm developed by Madabushiet al (2002) forthe automated structure-based prediction of functional sitesis also based on ET A patch is defined as a collection ofspatially close trace residues Sets of patches are detected anda statistical test is used to assign their collective significancebased on the likelihood to obtain at random a cluster of traceresidues of the size of the largest patch (Yaoet al 2003)Alternatively the statistical significance is assigned based onthe total number of patches of the trace residues that werefound The outcome is a set of key amino acids some of whichare buried and presumably important for the maintenance ofthe proper protein fold Others are exposed and comprise thevarious functional sites

We have developed the Rate4Site algorithm (Pupkoet al2002) which is a rigorous statistical method for inferringthe level of conservation at each amino acid position takinginto account the phylogenetic relations between the sequencesand the stochastic process underlying their evolution WhenRate4Sitersquos conservation scores are mapped onto the proteinsurface one or more surface patches formed by spatiallyclose and conserved residues often appear These surfacepatches are potentially important functional regions We sub-sequently launched the ConSurf webserver that implementsthis algorithm (Glaseret al 2003)

The automatic nature of the ConSurf server makes it easyto use but the interpretation of the results is subjective thereis no stringent criterion for determining the amino acid pos-itions that compose a conserved region Furthermore whilethe largest and most conserved surface patch is usually eas-ily detected by the naked eye it might be difficult to identifysecondary functional regions that are smaller in size andorinclude residues that are less conserved and distinguish themfrom the background noise Thus it is essential to discrimin-ate between a random cluster of conserved amino acids andconserved patches that are statistically significant

In this work we present a novel methodology for theautomatic identification of functionally important regions inproteins with a known 3D structure This methodology whichis referred to as PatchFinder is based on three features FirstPatchFinder permits patches of any shape and size on theprotein surface rather than limiting the patch to any pre-setshape (eg sphere) around a particular amino acid Secondamino acid positions that are on the protein surface are distin-guished from those that are buried in the protein core (Jonesand Thornton 1997) This allows us to approximately distin-guish between amino acids that are conserved due to structuralconstraints (Schueler-Furman and Baker 2003) and thosethat are conserved because they are functionally importantThird PatchFinder assigns a probability to each inferred patchto be functionally important and searches for the patch withthe highest probability Typically this patch corresponds tothe main conservation signal and is relatively easy to detectThe following non-overlapping patches are associated withlower likelihood values and are much more difficult to detecteven though they are often biologically important

2 ALGORITHMPatchFinderrsquos input includes the 3D structure of a query pro-tein and a multiple-sequence alignment (MSA) of the proteinand its sequence homologs

21 Assignment of conservation scoresThe first step in the PatchFinder algorithm is the assignmentof an evolutionary conservation score to each amino acid pos-ition using the Rate4Site algorithm (Pupkoet al 2002) asimplemented in the ConSurf webserver (Glaseret al 2003)In brief Rate4Site produces a ML estimate of the score at eachposition based on a MSA a phylogenetic tree and a model ofsequence evolution eg JTT (Joneset al 1992)

22 Identification of exposed and buried residuesIn order to approximately distinguish between positions thatare conserved due to structural constraints and those thatare conserved due to their functional importance the patchsearch procedure is limited to positions that are exposedto the solvent Discrimination between lsquoexposedrsquo and lsquobur-iedrsquo positions is based on the solvent accessible surface area(ASA) computed using the Surface Racer program (Tsodikovet al 2002) The computation is done on the 3D structureof the query protein using a probe sphere of radius 14 Aringcorresponding to a water molecule

PatchFinder further defines a residue as lsquoexposedrsquo if its rel-ative accessible surface area (RSA) is greater than a fixedvalue the default is 1 The RSA for each residue is definedas the fraction of its ASA relative to its maximal ASA Themaximal ASA of a residue is calculated in an extended GXGtripeptide where G is glycine and X is the residue in question(Miller et al 1987)

i329


GNimrod et al

23 Identification of patches of conservedresidues

Our algorithm aims to find the most significant primary andsecondary patches containing conserved and exposed aminoacids The challenge is to distinguish between a random clus-tering of a few conserved amino acids and a conserved patchthat is functionally important Based on previous studies (Aloyet al 2001 Dean and Golding 2000 Landgrafet al 2001Madabushiet al 2002 Valdar and Thornton 2001ab) ourintuition is that the functionality of a patch often correlateswith two main factors the number of amino acids that com-prise it and their average conservation The assumption hereis that a significant cluster of conserved residues on the pro-tein surface is usually indicative of an evolutionary pressurepresumably to maintain function In summary the largerand more conserved a patch is the more likely it is to bea functional region

231 Formal definition of terms A patch is defined as acluster of spatially close amino acids Clustering is based ona default cutoff distance of 4 Aring between any two heavy atoms(Valdar and Thornton 2001a) The patch size is defined interms of the number of amino acids that comprise it We definethe conservation of the patch as the average conservation scoreof its residues

232 The overall search procedure Our objective is tosearch the space of possible patches and to find the patchwith the lowest probability to occur by chance We use thefollowing search procedure we set a minimal average con-servation (MAC) cutoff for a patch search for the biggestpatches with an average conservation equal to or higher thanthe cutoff and then compute its probability Thus for eachpossible MAC cutoff we obtain a candidate patch and itsprobability We finally choose the value of MAC cutoff thatresults with the patch with the highest probability or in otherwords the ML patch

In practice the time complexity of a complete search pro-cedure is exponential (in sequence length) and therefore notapplicable for proteins We used a heuristic greedy search pro-cedure because it is much faster and results in a satisfactoryperformance The search procedure has two stages initiationand extension In the initiation stage each of the 10 residueswith the highest conservation score on the protein surface isselected as a starting point In the extension stage the mosthighly conserved of the neighboring residues is added to theexisting patch and the average conservation of the new patchis calculated If the average conservation is higher than thecutoff the new residue is accepted and the extension stage isrepeated The search procedure ends when the average con-servation of the patch drops below the cutoff Subsequentlya likelihood score is assigned to each patch (see below)Thus for each set of possible MAC cutoffs a list of patchesand their corresponding likelihoods are obtained The patch

with the highest likelihood is defined as the primary (main)patch

233 Assigning likelihood to each conserved patch Eachof the patches that were found in the previous stage is assigneda probability to be functional This is achieved by shuff-ling the conservation scores (Madabushiet al 2002 Valdarand Thornton 2001a Yaoet al 2003) between the exposedresidues and performing the clustering identification stageagain for each MAC cutoff This procedure is repeatedlycarried outN times (the default is 50 000) In this way adistribution of patch sizes is obtained for each MAC cutoffThe probability of obtaining a cluster of sizeX and conser-vation scoreY equals the probability of obtaining a patch ofsizeX or more with average conservationgeY This valueis approximated by measuring the frequency of patch of sizeX or bigger for the same MAC cutoff used for the estimatedpatch The probability assigned to the patch would be thecomplementary event

234 Use of conservation percentiles to set the MAC cutoffsIn the search procedure described above a patch is found foreach MAC cutoff and the patch with the highest likelihood isselected In PatchFinder the MAC cutoffs are chosen based onthe distribution of the conservation scores of all the exposedamino acids The interval between query MAC cutoff valuesis 1 The default search was limited to the 60ndash100 percent-ile interval with band size 1 (100 99 hellip 60) because ourresults showed that the likelihood generally drops drasticallyfor percentileslt60

In some cases two patches are assigned to very close prob-ability scores It is thus also recommended to inspect over-lapping patches that were ranked highly by the PatchFinderThese alternative patches can serve as a lsquoconfidence inter-valrsquo around the patch found Patches with identical computedlikelihood (generally 100) are ranked by theirZ-scores (iethe size of the patch minus the average size divided by thestandard deviation)

24 Identification of secondary patchesThe methodology described above yields a primary functionalpatch with a likelihood value that is oftensim100 This patchis most likely to be the main functional region of the proteinHowever in many cases we find that there are additionalnon-overlapping patches with high likelihood to be functionalThese patches may represent secondary functional regionsOnce the primary patch with the highest likelihood on the listis found we begin to search for significant patches limitingthe search to only those residues that are not found in theprimary patch Also at this stage the likelihood of the patch isdetermined using a random shuffling procedure in which foreach randomization we discard the most conserved patch withsize that is equal to that of the principal patch of the previousstage(s) This procedure is performed by the hill climbingheuristic

i330


PatchFinder

The calculated likelihood values are used only to rank theputative patches according to their statistical significance Inmost cases the actual probability of the most significant patchto be functional is reflected by its average conservation

25 Final outputThe final output includes up to three non-overlapping patchesThe likelihood and size of each patch and the identities of theresidues that compose it are reported

3 IMPLEMENTATIONThe algorithm was implemented using C++ and PERL andis available at httpashtorettauacilsimnimrodg

31 Evaluation of sensitivity and specificityWe evaluated the performance of PatchFinder in terms ofsensitivity (the fraction of functional residues identified) andspecificity (the fraction of functionally important residuesidentified out of the total number of residues in the patch)For example consider a case where the true patch was com-posed of 9 amino acids and PatchFinder inferred a patch ofsize 8 of which 7 amino acids were part of the true patch Insuch a case the sensitivity is 79 and the specificity 78

32 In-depth analysis IGPS and the SH2 domainThe results were evaluated using previously known data andSURFV (Sridharanet al 1992) analysis The gold standardactive-site residues were considered to be amino acids knownin the literature to be catalytically important as well as thosewith exposure levels that decreased bygt10 when comparingthe protein chain alone to the proteinndashligand complex Similarresults were obtained where the gold standard was based onthe LIGPLOT (Wallaceet al 1995) data

Src was divided into domains as follows SH2 Trp148 toPro246 SH3 Met82 to Glu147 Kinase Thr247 to Glu524pTyr527 and the three residues following it were consideredas the phosphopeptide (Xuet al 1997)

Conservation analysis for IGPS was carried out using anMSA of 153 homologs from the HSSP database (Sander andSchneider 1991) For the SH2 domain an MSA of residuesTrp148 to Pro246 was used (Pupkoet al 2002) It contained34 Src-like SH2 domains with sequence identity ofgt60The MSAs are available as Supplementary material

33 Benchmarking331 Dataset Benchmarking was based on a non-redundant set of 112 single-chain proteins with functionalresidues that are documented in the PDB (del Sol Mesaet al2003) The documentation is of varied quality Neverthelesssince the researchers who determined the structures manuallyannotated it we assume that the annotations are reliable andrepresent at least portion of the functional positions

332 Evaluation of the benchmarking results The activesite residues documentation is in many cases partial so the

SITE residues cannot be treated as a perfect gold standard Wetherefore used a statistical test similar to the one presentedby del Sol Mesaet al (2003) to assess the performance ofPatchFinder results The test examines the null hypothesisthat PatchFinder locates the patch at random regardless offunctionality and is based on the distance between the residuethat is closest to the patch center and the residue that is closestto the SITErsquos center Under the null hypothesis the patchcenter is a random choice from among the exposed residuesand theP -value of the test for a single protein is computedfrom the resulting distribution of distances from the SITErsquoscenter The full collection ofP -values is then compared to auniform distribution

4 RESULTS41 The IGPS Enzyme411 General description Indole-3-Glycerol PhosphateSynthase (IGPS) catalyses one of the reactions in the path-way of biosynthesis of tryptophan The structure of the IGPSfrom Sulfolobus solfataricus was determined with both CdRP(PDB ID 1lbl) and IGP (PDB ID 1a53) (Henniget al 2002)This allows a comprehensive examination of the active site

412 Patch analysis PatchFinder analysis of this structure(PDB ID 1lbl) yielded a few overlapping patches each withlikelihood of 100 Of these the one that is composed of19 residues and was assigned the bestZ-score is presented inTable 1 The experimental data indicate that six residues ofIGPS are directly involved in ligand binding (Darimontet al1998) and the patch includes them all A total of 23 residuesare in contact with either IGP or CdRP 18 of which are inthe patch (Table 1) The other five residues (Leu83 Asp111Ile213 Ile232 and Ser234) were detected in patches that areassigned lower MAC cutoffs and with high likelihood valuesOne residue Gly 91 was in the patch but does not contact anyof the ligands examined Nevertheless it shows a high levelof conservation

In this example the PatchFinder analysis yields a singlefunctional patch that corresponds approximately to the ligand-binding site (Fig 1) The sensitivity in this case is 18 out of23 amino acids (sim78) and the specificity is 18 out of 19residues in the patch (sim95)

42 The SH2 domain421 General description Src is a tyrosine kinase thatfunctions in receptor-mediated signal transduction It isan intracellular protein composed of three main domainskinase SH2 and SH3 Regulation of the kinase activityis achieved through the interactions among these domains(Brown and Cooper 1996) In the basal state Srcrsquos Tyr527which is located on the C-tail is phosphorylated the tail bindsto a phosphopeptide-binding pocket in the SH2 domain andthe enzyme is inactive The exchange of the C tail with anexternal phosphopeptide may lead to Src activation

i331


GNimrod et al

Table 1 PatchFinder results for IGPS

Residuea IGPb CdRPc

False Leu 83 +negatives Asp 111 +

Ile 213 + +Ile 232 + +Ser 234 + +

Glu 51 +Lys 53 + +Ser 56 +Pro 57 +Phe 89 + +Lys 110 + +Phe 112 + +Leu 131 +

True Ile 133 +Glu 159 + +Asn 180 + +Arg 182 + +Leu 184 + +Glu 210 + +Ser 211 + +Gly 212 + +Leu 231 + +Gly 233 + +

False positives Gly 91

The residues were partitioned into three groups True false negatives and false positivesaThe amino acid type and number Amino acids marked with lsquorsquo were confirmed exper-imentally as catalytically important residues (Darimontet al 1998)bThe lsquo+rsquo sign marks amino acids that were found to be in contact with the IGP substrate(PDB ID 1a53)cThe lsquo+rsquo sign marks amino acids that were found to be in contact with the CdRP substrate(PDB ID 1lbl)

Here we present an analysis of Srcrsquos SH2 domain This regu-latory domain containssim100 amino acids and is very commonin signaling proteins (Xuet al 1997) For the analysis weused the crystal structure of Xuet al (1997) (PDB ID 1fmk)where Src is in its inactive state

422 Patch analysis PatchFinder analysis of the domainyielded patches as described in Table 2 and Figure 2 Thefirst patch (patch 1 Table 2) includes 14 highly conservedresidues This is the peptide-binding groove Using SURFV11 residues were inferred to be a part of the binding siteEight of them are in the patch identified by PatchFinderone (Thr215) is identified in the next best patch and two(Thr179 and Cys185) are located on the boundary of thebinding groove and assigned conservation scores that aresignificantly lower than the average conservation of thepatch

We next focused on SH2rsquos interactions with the SH3 andkinase domains within the intact Src protein SH2rsquos interfaceswith these domains were identified based on SURFV analysis

of the difference in the water-accessible surface area of eachSH2 residue with and without the other domains The res-ults yielded the interface between the domains However theinterfaces can only be viewed as an approximated gold stand-ard since the functionally important residues are only partof the interface (DeLano 2002) Thus the SURFV analysiscan only be used to confirm true positive hits and point to apossible function

Of the secondary patches the one that was assigned the MLhas three invariable residues (Gly210 Gly211 and Phe220)These three residues constitute the core of a larger patch of 16residues which was ranked second in its likelihood (patch 2Table 2) The latter patch corresponds to the interface betweenthe SH2 and SH3 domains and to a part of the interfacebetween the SH2 and kinase domains

The third non-overlapping patch (patch 3 Table 2) is com-posed of three amino acids Glu159 Arg160 and Leu163 thatare mainly located at the interface between the kinase and SH2domains

Overall 33 amino acids were detected in the 3 patches Asdemonstrated in Figure 2 they correspond well to the peptide-binding groove and the interfaces of the SH2 domain withthe rest of intact Src The sensitivity in the peptide-bindinggroove in this case was 811 (sim72) and the specificity 814(sim57)

The SH2 domain examined here was also crystallized with ahigh-affinity phosphotyrosyl peptide (Waksmanet al 1993)In this complex three extra residues were added to those iden-tified as the C-tail contacts two of which (Gly236 and Leu237)were in the PatchFinder result The differences between thetwo experimental results demonstrate some of the problemsassociated with the definition of a gold standard in this andsimilar cases

43 Benchmarking431 General description Analysis of 112 protein struc-tures that contain 931 annotated SITE residues was conduc-ted in order to assess the overall performance of PatchFinderThe data are available on the accompanied website A totalof 3221 residues including 431 of the annotated SITEswere found in the 112 main patches that were detected byPatchFinder In 85 of the cases the first patch is composedof at least one of the SITE residues and in 63 of the casesat least half of these residues are in the patch

A simple approach to the detection of functional sitesinvolves mapping of all the surface residues that are evolu-tionarily conserved regardless of whether they are located inspatial proximity to each other Our survey included all theexposed residues for which the conservation scores were atleast as high as the minimal score in the ML patch In compar-ison with the PatchFinder results the number of SITE residuesthat are detected correctly this way is 476 ie asim10 increasein the detection rate However the total number of predictedfunctional residues is 4548 ie asim41 increase Thus the

i332


PatchFinder

A B

Fig 1 The active site found by PatchFinder in the IGPS enzyme (PDB ID 1lbl) (A) ConSurf conservation profile of the enzyme mapped ontothe space-filling representation of the protein The conservation color-coding plate is presented below and the ligand is colored yellow (B)The same view as in A with the residues identified by PatchFinder color-coded by conservation while the rest of the molecule is gray Onlyresidues with a RSAgt01 were considered in the search procedure The picture was produced using RASMOL (Sayle and Milner-White1995)

results suggest that PatchFinderrsquos search for patches of spa-tially close residues helps to reduce the rate of false positivepredictions

432 P -value assessment Because of the incompletenature of PDB annotation of the active sites we consider theP -value measure based on the lsquocentersrsquo of the patch and theactive site to be particularly suitable for this statistical sur-vey In 48 of the proteins (sim43) the main patch that wasdetected has aP -valuelt005 Sixty nine results (62) have aP -valuelt01 SimilarP -values were obtained for both smalland large proteins

433 A closer look at failures We examined about 20 ofthe cases that were assigned the highestP -values in searchfor the main reasons for failures Our analysis revealed fourmain categories (1) The MSA included only several pro-teins that were too closely related to each other In suchcases it is difficult to distinguish between amino acid pos-itions that are conserved due to the evolutionary pressureand those that appear to be conserved due to insufficientevolutionary time (2) Most of the SITE residues are lessconserved than the patches that were detected (or not con-served at all) For example the serine protease of PDB ID1 thm (Teplyakovet al 1990) includes 17 SITE residuesand PatchFinder detected an evolutionarily conserved patch of14 residues However the vast majority of the SITE residueswere assigned a conservation score that was lower than theaverage conservation of the patch residues Thus only threeof the SITE residues were included in the patch Interestinglyeight of the remaining patch residues are in contact with theproteinrsquos inhibitor (Groset al 1989) which suggests that

PatchFinderrsquos patch is functionally important nevertheless (3)In PDB ID 1mup (Bocskeiet al 1992) and other cases mostof the SITE residues are buried (4) PatchFinderrsquos patch ismuch bigger than the documented active site this may indic-ate a partial documentation of the SITE residues Pyruvatedehydrogenase (PDB ID 1iyu Berget al 1996) which hasonly one documented SITE residue is a very likely exampleof such a case

In the first two categories evolutionary conservation isobviously unsuitable for functional site inference and com-plementary approaches should be used

5 DISCUSSIONIn this work we presented a novel procedure for the iden-tification of functional regions of proteins with known 3Dstructures The procedure identifies the boundaries sizes andstatistical significance of evolutionarily conserved patches ofresidues on the protein surface In the following we will relatethe current methodology to similar approaches

Unlike the previously described methods PatchFinder sep-arates exposed and buried residues as an approximate meansto distinguish between functionally and structurally import-ant residues The importance of this feature was demonstratedin test calculations that were carried out without taking intoaccount the buriedexposed nature of the amino acids Therates of false-negative and -positive predictions in thesecalculations (data not shown) were significantly higher thanthe rates reported above Moreover analysis of the benchmarkdataset showed thatsim86 of the residues that are documentedas SITEs are exposed In comparison onlysim64 of theconserved residues are exposed Our survey (Supplementary

i333


GNimrod et al

Table 2 PatchFinder results for the SH2 domain

Interface Residuea PatchFinderb

Phosphopeptide Arg 155 1Arg 175 1Ser 177 1Thr 179Cys 185Asp 190 1Lys 200 1His 201 1Tyr 202 1Lys 203 1Arg 205 2Tyr 213 1Ile 214 1Thr 215 2Leu 237 1Cys 238 1

SH3 Trp 148 2Tyr 149 2Tyr 184Leu 223Gln 224Val 227 2Val 244

Kinase Phe 150 2Arg 155 1Arg 156Glu 157Glu 159 3Arg 160 3Leu 161Leu 163 3Asn 164Glu 178Thr 179Cys 245 2Pro 246

Others Leu 207 2Asp 208 2Gly 210 2Gly 211 2Arg 217 2Gln 219 2Phe 220 2Tyr 229 2Asp 235 2Gly 236 1Leu 241 1

Residues are segregated into four classes lsquoPhosphopeptidersquo those that comprise theinterface with the phosphopeptide lsquoSH3rsquo the SH3 domain lsquoKinasersquo the kinase domainsand lsquoOthersrsquo residues that are not in any of these interfaces It is noteworthy that a residuecan belong to more than one categoryaThe amino acid type and number are partitioned according to their contacts with therelevant domain Amino acids marked with lsquorsquo were not detected as contacts but are inclose proximity (lt4 Aring) to the ones that compose the patchbThe three patches found by PatchFinder with numbers corresponding to the iterationwhere the patch was located

material) showed a factor of 29-enrichment in SITE residuesin the conserved-and-exposed amino acids compared with theconserved-and-buried

Similar to Madabushiet al (2002) PatchFinder does notimpose any initial constraints on the shape of the functionalpatches Like Dean and Golding (2000) the patch cluster-ing procedure is flexible and allows inclusion of moderatelyconserved residues by the use of the average conservation ofthe patch PatchFinder is also capable of identifying second-ary functional regions with a weaker conservation signal thanthe main one

PatchFinder depends on the quality of the input MSAprovided When the alignment is unreliable or not diverseenough the inferred conservation scores are incorrect and thesearch for functional regions may become meaningless

The accuracy of the calculations also depends on the input3D structure of the protein Some proteins are known tofluctuate between two or more biologically relevant conform-ations for example active and inactive states of an enzymeThis variability might change the level of exposure of aminoacids and their mutual distances both quantities are exploitedby PatchFinder Using only a single structure might increasethe rate of false-negative predictions

In the two examples that were presented in detail (Figs 1and 2 Tables 1 and 2) we show the potency and advantagesof this new methodology As other algorithms (Madabushiet al 2002 Aloy et al 2001) PatchFinder successfullyidentified many of the reportedinferred functional residuesof the primary site on the protein surface while the predictedfunctional patches included a few potentially non-functionalresidues Studies performed on the benchmark set yielded sim-ilar results (see Supplementary material) Thus PatchFindercan be used for automatic high-throughput detection ofthe approximate location of the main functional regions ofproteins

The detection of secondary patches is more challengingbecause of their weaker conservation signal andor smallersize PatchFinder detected such patches in the SH2 domainand assigned them with reasonably high average conserva-tion scores (Fig 2 Table 2) However they were assignedwith low likelihood values thus complicating their automaticidentification

There are several features and improvements that couldbe introduced into PatchFinder (1) PatchFinder currentlysearches for patches of highly conserved positions while insome cases the functionally important region may even behyper-variable as the peptide-binding groove of the MHCclass I heavy chain (see the GALLERY of the ConSurf serverat httpconsurftauacil) (2) The inclusion of informationabout the sequential location of the conserved amino acidsmay improve PatchFinderrsquos sensitivity (Mihaleket al 2003)(3) Currently PatchFinder does not take into account specificattributes of different kinds of functional sites for examplethe residues that compose catalytic sites are generally partially

i334


PatchFinder

Fig 2 The functional regions found by PatchFinder in the SH2 domain of human Src (PDB ID 1fmk Xuet al 1997) The kinase domain iscolored yellow the SH3 domain is colored orange and the C-tail is in bright green The SH2 domain is shown in a space filled representationcolored as follows (A) and (C) colored by the conservation scale of Figure 1 (B) and (D) patch residues are colored by conservation and therest of the amino acids are gray Only residues with a RSAgt1 were considered in the search procedure The picture was produced usingRASMOL (Sayle and Milner-White 1995)

buried and highly conserved while proteinndashprotein inter-action regions are typically found in larger cavities withlower average conservation This suggests the adjustment ofparameters when searching for a patch of a specific type Infact we took a first step in this direction by using differ-ent criteria for discrimination between buried and exposedresidues in active sites (a cutoff of 01) and in inter-proteininterfaces (a cutoff of 1) Similarly the accuracy of detec-tion of enzyme active sites can be enhanced by the addition

of a requirement that at least one of the residues be polar(Aloy et al 2001) (4) We could also consider other proper-ties of the patch such as the curvature the identity and natureof the amino acids comprising it and their class specificity(Chakrabarti and Jannin 2002 Shanahanet al 2004) Thisis likely to improve PatchFinderrsquos accuracy in the detectionof functional regions Moreover the improved procedure islikely to provide information on the function of the region(proteinndashprotein interface ligand- or DNA-binding etc)

i335


GNimrod et al

which is completely missing now Finally the combinationof some of these measures in the clustering procedure mayalso improve the accuracy and sensitivity of the calculations(Mihalek et al 2004 Oliveiraet al 2003)

In conclusion we believe that PatchFinderrsquos capacity forhigh-throughput screening of functional regions in proteinsof known 3D structure will be useful in the context of theproteomics and structural genomics initiatives and may beused to further characterize known functional regions as wellas to reveal conserved regions that are as yet unknown

ACKNOWLEDGEMENTSWe thank Itay Mayrose and Sarel Fleishman for helpful dis-cussions and comments on the manuscript This work wassupported by the Israel Cancer Association grant to NB-TThe research was initiated when TP was doing his Post-doctorate with Dr Dave Swofford He thanks Dave Swoffordfor his support and for helpful discussion TP was supportedby a grant in Complexity Science from the Yeshaia HorvitzAssociation

REFERENCESAloyP QuerolE AvilesFX and SternbergMJ (2001) Auto-

mated structure-based prediction of functional sites in proteinsapplications to assessing the validity of inheriting protein func-tion from homology in genome annotation and to protein dockingJ Mol Biol 311 395ndash408

AmitaiG ShemeshA SitbonE ShklarM NetanelyDVengerI and PietrokovskiS (2004) Network analysis of pro-tein structures identifies functional residuesJ Mol Biol 3441135ndash1146

BergA VervoortJ and de KokA (1996) Solution structure of thelipoyl domain of the 2-oxoglutarate dehydrogenase complex fromAzotobacter vinelandii J Mol Biol 261 432ndash442

BocskeiZ GroomCR FlowerDR WrightCE PhillipsSECavaggioniA FindlayJB and NorthAC (1992) Pheromonebinding to two rodent urinary proteins revealed by X-ray crystal-lographyNature 360 186ndash188

BrownMT and CooperJA (1996) Regulation substrates andfunctions of srcBiochim Biophys Acta 1287 121ndash149

Chakrabarti P and Janin J (2002) Dissecting proteinndashproteinrecognition sitesProteins 47 334ndash343

DarimontB StehlinC SzadkowskiH and KirschnerK (1998)Mutational analysis of the active site of indoleglycerolphosphate synthase fromEscherichia coli Protein Sci 71221ndash1232

DeanAM and GoldingGB (2000) Enzyme evolution explained(sort of )Pac Symp Biocomput 6ndash17

del Sol MesaA PazosF and ValenciaA (2003) Automaticmethods for predicting functionally important residuesJ MolBiol 326 1289ndash1302

DeLanoWL (2002) Unraveling hot spots in binding interfacesprogress and challengesCurr Opin Struct Biol 12 14ndash20

FriedbergI and MargalitH (2002) Persistently conserved posi-tions in structurally similar sequence dissimilar proteins roles inpreserving protein fold and functionProtein Sci 11 350ndash360

GlaserF PupkoT PazI BellRE Bechor-ShentalD MartzEand Ben-TalN (2003) ConSurf identification of functionalregions in proteins by surface-mapping of phylogenetic informa-tion Bioinformatics 19 163ndash164

GrosP FujinagaM DijkstraBW KalkKH and HolWG(1989) Crystallographic refinement by incorporation of molecu-lar dynamics thermostable serine protease thermitase complexedwith eglin cActa Crystallogr B 45 488ndash499

HennigM DarimontBD JansoniusJN and KirschnerK (2002)The catalytic mechanism of indole-3-glycerol phosphate syn-thase crystal structures of complexes of the enzyme fromSulfolo-bus solfataricus with substrate analogue substrate and productJ Mol Biol 319 757ndash766

InnisCA AnandAP and SowdhaminiR (2004) Prediction offunctional sites in proteins using conserved functional groupanalysisJ Mol Biol 337 1053ndash1068

JonesDT TaylorWR and ThorntonJM (1992) The rapid gener-ation of mutation data matrices from protein sequencesComputAppl Biosci 8 275ndash282

JonesS and ThorntonJM (1997) Prediction of proteinndashproteininteraction sites using patch analysisJ Mol Biol 272 133ndash143

LandgrafR XenariosI and EisenbergD (2001) Three-dimensional cluster analysis identifies interfaces and functionalresidue clusters in proteinsJ Mol Biol 307 1487ndash1502

LichtargeO BourneHR and CohenFE (1996) An evolutionarytrace method defines binding surfaces common to protein familiesJ Mol Biol 257 342ndash358

LichtargeO and SowaME (2002) Evolutionary predictions ofbinding surfaces and interactionsCurr Opin Struct Biol 1221ndash27

MaB ElkayamT WolfsonH and NussinovR (2003) Proteinndashprotein interactions structurally conserved residues distinguishbetween binding sites and exposed protein surfacesProc NatlAcad Sci USA 100 5772ndash5777

MadabushiS YaoH MarshM KristensenDM PhilippiASowaME and LichtargeO (2002) Structural clusters of evol-utionary trace residues are statistically significant and common inproteinsJ Mol Biol 316 139ndash154

MihalekI ResI and LichtargeO (2004) A family of evolution-entropy hybrid methods for ranking protein residues by import-anceJ Mol Biol 336 1265ndash1282

MihalekI ResI YaoH and LichtargeO (2003) Combining infer-ence from evolution and geometric probability in protein structureevaluationJ Mol Biol 331 263ndash279

MillerS JaninJ LeskAM and ChothiaC (1987) Interiorand surface of monomeric proteinsJ Mol Biol 196641ndash656

NeuvirthH RazR and SchreiberG (2004) ProMate a structurebased prediction program to identify the location of proteinndashprotein binding sitesJ Mol Biol 338 181ndash199

OliveiraL PaivaPB PaivaAC and VriendG (2003)Identification of functionally conserved residues with theuse of entropy-variability plotsProteins 52 544ndash552

PanchenkoAR KondrashovF and BryantS (2004) Prediction offunctional sites by analysis of sequence and structure conserva-tion Protein Sci 13 884ndash892

PazosF and SternbergMJ (2004) Automated prediction of proteinfunction and detection of functional sites from structureProcNatl Acad Sci USA 101 14754ndash14759

i336


PatchFinder

PupkoT BellRE MayroseI GlaserF and Ben-TalN (2002)Rate4Site an algorithmic tool for the identification of func-tional regions in proteins by surface mapping of evolutionarydeterminants within their homologuesBioinformatics 18 (suppl)S71ndashS77

RostB (2002) Enzyme function less conserved than anticipatedJ Mol Biol 318 595ndash608

SanderC and SchneiderR (1991) Database of homology-derivedprotein structures and the structural meaning of sequence align-mentProteins 9 56ndash68

SayleRA and Milner-WhiteEJ (1995) RASMOL biomoleculargraphics for allTrends Biochem Sci 20 374ndash376

Schueler-FurmanO and BakerD (2003) Conserved residue clus-tering and protein structure predictionProteins 52 225ndash235

ShanahanHP GarciaMA JonesS and ThorntonJM (2004)Identifying DNA-binding proteins using structural motifs and theelectrostatic potentialNucleic Acids Res 32 4732ndash4741

SridharanS NichollsA and HonigB (1992) A new vertexalgorithm to calculate solvent accessible surface areaBiophysJ 61 A174

TeplyakovAV KuranovaIP HarutyunyanEH VainshteinBKFrommelC HohneWE and WilsonKS (1990) Crystal struc-ture of thermitase at 14 Aring resolutionJ Mol Biol 214 261ndash279

TsodikovOV RecordMT Jr and SergeevYV (2002) Novelcomputer program for fast exact calculation of accessible andmolecular surface areas and average surface curvatureJ ComptChem 23 600ndash609

ValdarWS and ThorntonJM (2001a) Conservation helps toidentify biologically relevant crystal contactsJ Mol Biol 313399ndash416

ValdarWS and ThorntonJM (2001b) Proteinndashprotein interfacesanalysis of amino acid conservation in homodimersProteins 42108ndash124

WaksmanG ShoelsonSE PantN CowburnD and KuriyanJ(1993) Binding of a high affinity phosphotyrosyl peptide to the SrcSH2 domain crystal structures of the complexed and peptide-freeformsCell 72 779ndash790

WallaceAC LaskowskiRA and ThorntonJM (1995) LIGPLOTa program to generate schematic diagrams of proteinndashligandinteractionsProtein Eng 8 127ndash134

XuW HarrisonSC and EckMJ (1997) Three-dimensional struc-ture of the tyrosine kinase c-SrcNature 385 595ndash602

YaoH KristensenDM MihalekI SowaME ShawCKimmelM KavrakiL and LichtargeO (2003) An accuratesensitive and scalable method to identify functional sites inprotein structuresJ Mol Biol 326 255ndash261

i337


PatchFinder

Aloy et al (2001) used a method based on EvolutionaryTrace (ET Lichtargeet al 1996) to assign conservationscores to each position The ET method uses a phylogen-etic tree of the query proteinrsquos family and detects residues thatshow evolutionary conservation in at least a subgroup (branch)of the tree ie trace residues Aloyet al consider as candid-ates for functional patches only sites with amino acids thatare polar and totally conserved (or with one lsquoresidue sequenceerrorrsquo) This definition suffers from low sensitivity an aminoacid position is either lsquoconservedrsquo or not In practice thereare many levels of conservation and the exclusion of posi-tions that are not totally conserved or not polar may obscurefunctional regions Moreover the method is very sensitive tothe choice of the homologous protein sequences that are usedthe addition of more homologs is expected to reduce the num-ber of putative positions thereby counter-intuitively reducingthe power of the method Aloyet alrsquos algorithm also assumesthat the shape of the patch is spherical which may not alwaysbe suitable

The algorithm developed by Madabushiet al (2002) forthe automated structure-based prediction of functional sitesis also based on ET A patch is defined as a collection ofspatially close trace residues Sets of patches are detected anda statistical test is used to assign their collective significancebased on the likelihood to obtain at random a cluster of traceresidues of the size of the largest patch (Yaoet al 2003)Alternatively the statistical significance is assigned based onthe total number of patches of the trace residues that werefound The outcome is a set of key amino acids some of whichare buried and presumably important for the maintenance ofthe proper protein fold Others are exposed and comprise thevarious functional sites

We have developed the Rate4Site algorithm (Pupkoet al2002) which is a rigorous statistical method for inferringthe level of conservation at each amino acid position takinginto account the phylogenetic relations between the sequencesand the stochastic process underlying their evolution WhenRate4Sitersquos conservation scores are mapped onto the proteinsurface one or more surface patches formed by spatiallyclose and conserved residues often appear These surfacepatches are potentially important functional regions We sub-sequently launched the ConSurf webserver that implementsthis algorithm (Glaseret al 2003)

The automatic nature of the ConSurf server makes it easyto use but the interpretation of the results is subjective thereis no stringent criterion for determining the amino acid pos-itions that compose a conserved region Furthermore whilethe largest and most conserved surface patch is usually eas-ily detected by the naked eye it might be difficult to identifysecondary functional regions that are smaller in size andorinclude residues that are less conserved and distinguish themfrom the background noise Thus it is essential to discrimin-ate between a random cluster of conserved amino acids andconserved patches that are statistically significant

In this work we present a novel methodology for theautomatic identification of functionally important regions inproteins with a known 3D structure This methodology whichis referred to as PatchFinder is based on three features FirstPatchFinder permits patches of any shape and size on theprotein surface rather than limiting the patch to any pre-setshape (eg sphere) around a particular amino acid Secondamino acid positions that are on the protein surface are distin-guished from those that are buried in the protein core (Jonesand Thornton 1997) This allows us to approximately distin-guish between amino acids that are conserved due to structuralconstraints (Schueler-Furman and Baker 2003) and thosethat are conserved because they are functionally importantThird PatchFinder assigns a probability to each inferred patchto be functionally important and searches for the patch withthe highest probability Typically this patch corresponds tothe main conservation signal and is relatively easy to detectThe following non-overlapping patches are associated withlower likelihood values and are much more difficult to detecteven though they are often biologically important

2 ALGORITHMPatchFinderrsquos input includes the 3D structure of a query pro-tein and a multiple-sequence alignment (MSA) of the proteinand its sequence homologs

21 Assignment of conservation scoresThe first step in the PatchFinder algorithm is the assignmentof an evolutionary conservation score to each amino acid pos-ition using the Rate4Site algorithm (Pupkoet al 2002) asimplemented in the ConSurf webserver (Glaseret al 2003)In brief Rate4Site produces a ML estimate of the score at eachposition based on a MSA a phylogenetic tree and a model ofsequence evolution eg JTT (Joneset al 1992)

22 Identification of exposed and buried residuesIn order to approximately distinguish between positions thatare conserved due to structural constraints and those thatare conserved due to their functional importance the patchsearch procedure is limited to positions that are exposedto the solvent Discrimination between lsquoexposedrsquo and lsquobur-iedrsquo positions is based on the solvent accessible surface area(ASA) computed using the Surface Racer program (Tsodikovet al 2002) The computation is done on the 3D structureof the query protein using a probe sphere of radius 14 Aringcorresponding to a water molecule

PatchFinder further defines a residue as lsquoexposedrsquo if its rel-ative accessible surface area (RSA) is greater than a fixedvalue the default is 1 The RSA for each residue is definedas the fraction of its ASA relative to its maximal ASA Themaximal ASA of a residue is calculated in an extended GXGtripeptide where G is glycine and X is the residue in question(Miller et al 1987)

i329


GNimrod et al











i330


PatchFinder















i331


GNimrod et al


Residuea IGPb CdRPc


Ile 213 + +Ile 232 + +Ser 234 + +















i332


PatchFinder

A B









i333


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


GNimrod et al











i330


PatchFinder















i331


GNimrod et al


Residuea IGPb CdRPc


Ile 213 + +Ile 232 + +Ser 234 + +















i332


PatchFinder

A B









i333


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


PatchFinder















i331


GNimrod et al


Residuea IGPb CdRPc


Ile 213 + +Ile 232 + +Ser 234 + +















i332


PatchFinder

A B









i333


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


GNimrod et al


Residuea IGPb CdRPc


Ile 213 + +Ile 232 + +Ser 234 + +















i332


PatchFinder

A B









i333


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


PatchFinder

A B









i333


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


GNimrod et al















i334


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


PatchFinder




i335


GNimrod et al


































i336


PatchFinder
















i337


GNimrod et al


































i336


PatchFinder
















i337


PatchFinder
















i337

bioinformatics vol.21suppl.12005,pagesi328–i337talp/publications/patchfinder.pdf · “bti1023”...

Documents