sequence-based protein domain boundary prediction using bp neural network with various property...

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

Sequence-based protein domain boundaryprediction using BP neural network withvarious property profilesLei Ye,1 Ting Liu,1 Zhaohui Wu,1 and Ruhong Zhou2,3*

1Department of Computer Science, Zhejiang University, Hangzhou, China

2 IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598

3Department of Chemistry, Columbia University, New York, New York 10027

INTRODUCTION

Identification of protein domain boundary is very important in many

protein studies, however, the domain boundary prediction from one

dimensional (1D) amino acid sequence is still one of the most challenging

problems remaining in molecular biology. A protein domain is often con-

sidered as the fundamental element of protein structure, function, and

evolution.1–3 It is believed that a protein domain can fold independently

or semi-independently into a stable and compact structure, which might

exhibit a rich evolutionary history and a specialized molecular function.4

A typical protein may be comprised of a single domain or several

domains, which are not necessarily contiguous. Given the observed ran-

dom distribution of hydrophobic residues in protein sequences, domain

formation appears to be the optimal solution for a large protein to bury

its hydrophobic residues while keeping hydrophilic ones near the surface.5

To accurately define protein structural domains based on three dimen-

sional (3D) tertiary structure itself is a difficult problem,6–8 and is cur-

rently best done manually by experts, with the SCOP domain classifica-

tion as an excellent example.9 Predicting the domain boundary from 1D

sequences alone is even more challenging. Many methods have been pro-

posed to address this important problem. According to Nagarajan and

Yona,10 there are five different categories of such methods. Here, we only

briefly list these methods and interested readers can refer to Ref. 10 for

more details.

i. Methods based on similarity search. These methods, such as

MKDOM,11 Domainer,12 and DOMO,13 either do an all-versus-all

BLAST14 search to identify segment pairs with high degree of homol-

ogy11,12 or cluster sequences into groups by comparing their amino

acid and dipeptide composition.13 They are often used to partition all

proteins within a database into domains, but these are, in general, less

accurate because of their heuristic nature.

ii. Methods based on expert knowledge. These methods rely on expert

knowledge of protein families to construct models like hidden Markov

models (HMM) and artificial neural networks to identify other mem-

*Correspondence to: Ruhong Zhou, Computational Biology Center, 1101 Kitchawan Road, Yorktown

Heights, NY 10598. E-mail: [email protected]

Received 12 February 2007; Revised 12 June 2007; Accepted 21 July 2007

Published online 11 October 2007 in Wiley InterScience (www.interscience.wiley.com).

DOI: 10.1002/prot.21745

ABSTRACT

Given the rapid growth in the number of

sequences without known structures, it is

becoming increasingly important to not only

accurately define protein structural domains

but also predict domain boundaries from the

amino-acid sequence alone. In this article, we

present a Back-Propagation (BP) neural net-

work method using 9 different sequence pro-

files, based on chemical, physical, and statisti-

cal properties, to predict the domain boundary

of two-domain proteins from one dimensional

sequences. We have achieved an accuracy of

69% with a 10-fold cross validation on a 238

nonredundant two-domain protein dataset

that we built based on a common set from both

SCOP and CATH classifications. The method

has also been applied to a larger third-party

dataset with 522 proteins; and an accuracy of

62% has been achieved. Our prediction results

on both datasets are found to be significantly

better than those from some other methods,

such as DomCut and DGS on the same data-

sets, and also comparable to that from the

PPRODO method, upon which the larger data-

set was based. Our cross validation results are

also noticeably better than previous ones from

other BP neural network methods, probably

because we have used more property descriptors

with significantly more training nodes in our

neural network. The integration with PPRODO

method also indicates that the information

obtained from our current approach is comple-

mentary to that available through multiple

sequence alignments. Moreover, the relative im-

portance of each property profile has been ana-

lyzed in detail.

Proteins 2008; 71:300–307.VVC 2007 Wiley-Liss, Inc.

Key words: protein domains; domain boundary

prediction; neural network; sequence profiling.

300 PROTEINS VVC 2007 WILEY-LISS, INC.

bers of the family. PFam A,15 TigrFam,16 and

SMART17 fall in this category. These methods,

though more accurate for certain families, are often

limited by their ability to predict less known or

unknown families.

iii. Methods based on predicted 3D information. These

methods try to predict 3D tertiary structures first

and then assign domain boundaries. RosettaDOM,18

SnapDragon,19 and Rigden’s covariance analysis20

are examples of this approach. These methods use 3D

structural information but are often computationally

more expensive.

iv. Methods based on multiple sequence alignment. These

methods, such as PASS,21 Domination,22 and Yona’s

hybrid method,10 use multiple sequence alignments

to predict domain boundaries.

v. Other methods. These are methods that do not fall

into previous categories, such as domain guess by

size (DGS).23

Despite the large number of studies from several

groups, the domain boundary prediction using 1D

sequence alone is still an open problem. Most of these

methods have unsatisfactory prediction accuracies (more

later). The expert knowledge based methods, such as

TigrFam16 and SMART,17 might be more accurate for

certain families, but they often require careful manual

inspections and are generally useful for subsets of pro-

teins. In this article, we propose a Back-Propagation (BP)

neural network method based on sequence property pro-

files to predict the domain boundary. This method will

loosely fall into the first category by Nagarajan and

Yona’s classification (i.e., methods based on similarity

search). However, the similarity search is not purely based

on the amino acid sequence matching, but on the

sequence profile mapping using properties from chemi-

cal, physical, and statistical features, with significantly

higher accuracy (see later). Similar methods have been

proposed previously as well, such as DomCut24 based on

a single domain linker index, the entropy method based

on a sequence entropy profile,25 and the CHOPnet26

method based on secondary structure, solvent accessibility,

flexibility, and protein length, etc. The current method

uses nine different property descriptors based on chemical,

physical, and statistical poperties with a total of 169 input

nodes in BP neural network training.

The article will be organized as following. Methods

Section presents the details of the method, including the

protein dataset preparation, description of input nodes

to the neural network, and the architecture of the BP

neural network. Results and Discussion Section describes

the results and discussion. Finally, Conclusion Section

gives the conclusion remarks. Given the difficulty of this

domain boundary prediction problem, we focus our cur-

rent efforts on the two-domain proteins only, to see if we

can improve the accuracy from previous methods (as

well as simple benchmarks like the Equal-split method).

Very encouraging results have been achieved on both

datasets used in our study: one built by ourselves using a

common set from both SCOP9 and CATH27 classifica-

tions and the other from the PPRODO method by Lee

and coworkers.28 We also analyzed in detail the relative

importance of each descriptor in their contribution to

the overall prediction accuracy. As shown later, the BP

neural network provides a powerful learning vehicle, given

the appropriate sequence profiles of properties related to

protein domain structures.

METHODS

Data preparation

The dataset used for training and testing in this article

consists of 238 two-domain proteins (contiguous do-

mains), which was built from a common set of both

SCOP (Structural Classification of Proteins)9 and CATH

(Class Architecture Topology Homology)27 classifica-

tions. As we know, SCOP and CATH might assign

slightly differently domain boundaries for the same pro-

tein, and the agreement between these two databases are

only about 80%.29 A two-step process is followed here to

obtain this common dataset. First, all protein chains

with two contiguous domains are extracted from SCOP9

(http://scop.mrc-lmb.cam.ac.uk/scop/, version 1.69) and

CATH27 (http://www.biochem.ucl.ac.uk/bsm/cath/cath.

html, version 2.6.0), respectively. Following previous

studies,24,25,28 proteins with a domain size �40 or

�500 are eliminated because of their excessively small or

large sizes (there are very few large-sized domains any-

way). The second step is to extract those common

proteins shared by both SCOP and CATH – with their

domain boundaries ‘‘close’’ enough. Here, we define two

protein domain identifications are close if a given protein

is classified in SCOP as (a,b) (c,d) and in CATH as (a0,b0)(c0,d0), where |b 2 b0| < 5 and |c 2 c0| < 5. Approxi-

mately 3500 proteins pass the above two filters. Then, the

program uniqueProt,30 is used to reduce the sequence

redundancy in the dataset (with the HSSP-threshold set

to 5).31 A HSSP-threshold value of five corresponds

roughly to less than 25% sequence identity in a global

alignment of a length of 250 residues.31 It should be

noted that the current dataset (and method) is address-

ing a problem of two-domain proteins only, which is a

step away from the more general problem of predicting

domain boundaries in arbitrary proteins (addressed in a

follow-up study). Nevertheless, it is still a non-trivial

problem to predict the correct domain boundary for a

known two-domain protein from its sequence. Finally,

our dataset contains 238 unique two-domain protein

chains. The distribution of the protein chain length in

our dataset is summarized in Table I.

Sequence-based Protein Domain Boundary Prediction

PROTEINS 301

Inputs of the neural network

There are a total of nine sequence profile descriptors

(indices) based on physical, chemical, and statistical prop-

erties in this study. We used a sliding-window size of 11

residues to smooth out the property profiles along the

amino acid sequence. Since the domain boundary typi-

cally does not appear near the N- or C-terminals, the cen-

tral residue of our windowing starts from the 26th resi-

dues (i.e., ignoring the first 25 residues) in the N-terminal,

and ends at the 26th residues from the C-terminal. For

each residue in this local sequence window, we calculated

the following 8 descriptors first (see below for the 9th

one): secondary structure (3 nodes, representing helix,

strand, and coil), relative solvent accessibility (1 node),

domain linker index, and averaged domain linker index (2

nodes), flexibility index and averaged flexibility index (2

nodes), hydrophobicity index and averaged hydrophobic-

ity index (2 nodes), entropy index (based on side-chain

entropy—it is the physical entropy not the statistical en-

tropy used in multiple sequence alignments) and averaged

entropy index (2 nodes), averaged hydrophobicity of resi-

dues near N- and C-terminals (2 nodes), and relative posi-

tion probability index (1 node). There are a total of 15

nodes from these eight descriptors for each residue in the

11-residue-sized window sliding average.

For better training, we used the observed secondary

structures and solvent accessibilities for training instead

of the predicted ones, even though the current method

does not require structural data in actual prediction—

only the sequence data is needed. Fairly accurate (�80%)

methods exist for the prediction of secondary struc-

tures32 and solvent accessibilities33 from amino acid

sequences. Here, the secondary structures and solvent

accessibilities are calculated from the PDB files using the

program STRIDE34 in our training and testing. Those

residues with missing coordinates in PDB files are

removed from the protein chain. Three nodes encode the

secondary structures as ‘‘helix’’, ‘‘strand’’, or ‘‘coil,’’ and

one node encodes the relative solvent accessibility as

‘‘buried’’ or ‘‘exposed’’ (under or over 20% of the total

surface area of each residue).34,35 The domain linker

index presented in DomCut,24 the flexibility index by

Vihinen,36 the hydrophobicity index and the entropy

index by Armadillo37 (which combines the indices from

the work of Kyte38 and Galzitskaya25), are employed in

our method. The averaged hydrophobicity of the residues

near the N- and C-terminals (AHNC) are defined as fol-

lows: AH(i)N- 5 (P

n51i21 h(n))/(i 2 1), AH(i)C- 5

(P

n5i11L h(n))/(L 2 i), where i is the ith residue in

protein chain, L is the length of protein chain, and h(n)

is the hydrophobicity index of the nth residue. The rela-

tive position probability index (RPPI) indicates the prob-

ability of a relative position, 0 (N-terminal) to 1 (C-ter-

minal), being the domain boundary.

To have a more accurate description of this RPPI index

and avoid over-dominance of one sized proteins, we have

tried generating the RPPI index according to different

size groups. We equally subdivided the protein size space

(50–800) into 15 groups with each group having a size of

50 residues in order to have a more accurate description

of this relative position probability with regard to differ-

ent sizes. The 238 proteins in the dataset are then binned

into these 15 groups based on their sizes (See Table I).

Each group’s normalized range (0,1) is further binned

into 20 subsections, with each protein falling into one of

these subsections. The normalized domain boundary

position for each protein can then be calculated as

(boundary position)/(chain length) and binned into these

20 subsections in its respective size group (the exact bin

size doesn’t matter much, and we have tried 15 bins and

the results do not change much). The relative position

probability index RPPI of the ith subsection in the nth

group thus is defined as follows: RPPI(i,n) 5 N(i)/

TN(n), where N(i) is the number of proteins whose rela-

tive boundary position fall into the ith subsection, and

TN(n) is the total number of proteins in this group. Part

of the RPPI extracted from the 238-protein dataset is dis-

played in Figure 1. This figure indicates that although a

large portion of the proteins have a domain boundary

near the middle of their chain lengths, there are still

many proteins whose domain boundary positions are not

near the middle of their chain lengths (RPPI �0.3 or

>0.7). This can also be seen from a simple equal-split

prediction (assuming the domain boundary is in the cen-

ter of the protein sequence, thus splitting the protein

into two equal parts)—only a �50% accuracy is

achieved, while our current method can achieve an accu-

racy of about 69% (see later). We have also tried subdi-

viding the size space (50–800) into several other groups,

such as 1, 8, or 30 groups, and the final accuracy results

Table IThe Distribution of the Protein Chain Length in Our 238 Nonredundant Two-

Domain Protein Database

Groups Chain length range Protein number

1 50–100 22 100–150 293 150–200 524 200–250 555 250–300 206 300–350 267 350–400 188 400–450 189 450–500 910 500–550 611 550–600 212 600–650 013 650–700 014 700–750 115 750–800 0

The chain length space (50–800) is subdivided equally into 15 groups with each

having a size range of 50. Each protein in the dataset is then binned into these 15

groups based on its chain length.

L. Ye et al.

302 PROTEINS

show only small differences, with the 15 groups display-

ing slightly better results than the others. Thus, in the

following results section, we will use 15 groups for the

RPPI generation, while for all the other descriptors, only

one group (i.e., with all the proteins) is used.

The last descriptor (9th descriptor) is for the central

residue only of the window, called HSNC, which is the

percentage of helix and strand residues from the N- and

C-terminals (4 nodes). They are calculated as follows:

Helix(i)N- 5 HN(i)/(i 2 1), Helix(i)C- 5 HC(i)/(L 2 i),

Strand(i)N- 5 SN(i)/(i 2 1), Strand(i)C- 5 SC(i)/(L 2i), where HN(i) and SN(i) are the number of helix and

strand residues in the region from the (i 2 1)th residue

to the N-terminal, and HC(i) and SC(i) are the number

of helix and strand residues in the region from the (i 11)th residues to the C-terminal. Out of the nine descrip-

tors, the secondary structure index, relative accessibility

index, linker index, flexibility index, hydrophobicity

index, and entropy index or their variations have been

used in previous studies, with either one index24,25 or

some combinations of a few,26 while the current RPPI

index, AHNC index and HSNC index are newly designed

to catch the underlying physics in structral features of

the domain boundary. The RPPI index measures the rela-

tive size or balance between the two domains, the AHNC

index indicates the fluctuation of the average hydropho-

bicity measured from both terminals, and the HSNC

index, on the other hand, measures the percentage of a-helical and b-strand residues from both terminals. Table II

summarizes all the nine descriptors used in this study.

Neural network architecture

The standard Back-Propagation feed-forward artificial

neural network is used in our method. The network has

15 3 11 1 4 5 169 input nodes (15 nodes from the first

eight descriptors for each residue in the 11-residue-sized

window, plus 4 HSNC nodes from the ninth descriptor

for the central residue), a single hidden layer of 5 nodes,

and 1 node in the output layer. A schema of the BP neural

network architecture is shown in Figure 2. The output

node indicates whether or not the central residue in the

Table IIThe List of All Descriptors Employed in This Work

Descriptor Node Remark

Secondary structure 3 Helix, Strand, or CoilRelative accessibility 1 Exposed or buriedLinker index 2 Linker index w/ or w/o averageFlexibility 2 Flexibility index w/ or w/o averageHydrophobicity 2 Hydrophobicity index w/ or w/o

averageEntropy 2 Entropy index w/ or w/o averageAHNC 2 Averaged hydrophobicity of

residues near the N- andC-terminals

RPPI 1 Probability of a relative positionbeing domain boundary

HSNC 4 Percentage of helix and strandresidues from the N- and C-terminals

The BP network has a total of 15 3 11 1 4 5 169 input nodes (15 nodes from

the first eight descriptors for each residue in the 11-residue-sized window, plus 4

HSNC nodes from the ninth descriptor for the central residue only).

Figure 1The distribution of proteins versus the relative position probability index (RPPI).

Three representative size groups, groups 3, 4, and 5, from our 238-protein

dataset, are shown. It indicates that even though a large portion of the proteins

have a domain boundary near the middle of their chain lengths, there are still

many proteins (total 26) whose domain boundary positions are not near the

middle of their chain lengths (RPPI �0.3 or �0.7).

Figure 2Architecture of the BP neural network. With a window size of 11, each residue

in the window has 15 nodes (11 3 15 nodes), and the central residue of the

window has additional four nodes, which gives a total of 169 nodes in the

neural network.


PROTEINS 303

window is a boundary residue. The residue with maxi-

mum output score is classified as the boundary of the pro-

tein chain. Similar to the criterion used in previous stud-

ies,10,26,28 any prediction within �20 residues from the

true domain boundary residues is considered a success.

To evaluate the performance of our method based on

the BP network, a 10-fold cross validation is performed.

The dataset is divided into 10 subsets randomly: 9 sets

for training and 1 set for testing (jackknife test). Ten in-

dependent calculations are performed so that each subset

is used as the testing set once. Since the starting neural

network is initialized with random weights and bias, up

to 20 different neural networks are trained for each inde-

pendent cross-validation calculation for robustness test,

and the best training set (weights corresponding to the

best prediction accuracy) is taken as the neural network

for prediction.

RESULTS AND DISCUSSION

Performance on the common SCOP andCATH dataset

In this study, a successful domain-boundary prediction

means the predicted domain boundary residue is within a

�20 residue window from the ‘‘correct’’ domain bound-

ary, which is assigned by the SCOP9 and CATH27 classifi-

cations (as aforementioned, we had chosen only those

proteins with a common or close-enough assignment

from the both classifications). The 10-fold cross-validation

prediction results for our 238 protein data set are shown

in Table III. We have achieved a 69.3% accuracy with a

window size 11 for the 238 protein set. We have also tried

window sizes of 7, 15, 19, as well as 23, and similar 10-

fold cross validation results are summarized in Table III as

well. It shows that window size 11 has the best overall per-

formance. Of course, these results are not that much dif-

ferent across all the window sizes tested here, indicating

that the results are reasonably robust with regard to differ-

ent window sizes. This 69.3% accuracy is noticeably

higher than previous BP neural network results, such as

about 50% accuracy in the CHOPnet26 for a similar data-

set (see later). The reason for this could be the fact that

we have used more property descriptors with significantly

more training nodes in our neural network (169 nodes vs.

57 nodes in CHOPnet26). This accuracy is also signifi-

cantly higher than that from a simple equal-split predic-

tion, i.e., assuming the domain boundary is in the center

of the protein sequence, thus splitting the protein into

two equal parts. As aforementioned, only a �50% accu-

racy is achieved by this equal-splitting method (similar

results for the 522 dataset below).

As for comparison, we also applied the DomCut24 and

DGS23 methods to our current dataset (these are the

ones freely available to us on the web). In general, it is

exceptionally difficult to compare accuracies across differ-

ent methods published in literature, given the differences

in datasets, domain linker definitions, and evaluation cri-

teria. Here, for the DomCut, the predicted boundary is

the residue with the lowest value in its linker preference

profile as recommended, and the linker preference profile

result comes from the DomCut server (http://www.bork.

embl-heidelberg.de/s̃uyama/domcut/). The prediction ac-

curacy of DomCut is only 30.67% with the same �20

criterion (given its simplicity with only one linker index,

the results are not that bad). The predicted boundary of

DGS is chosen from the first prediction which is assigned

as two-continuous-domain protein. The DGS program

is downloaded from NCBI (ftp://ftp.ncbi.nih.gov/pub/

wheelan/). The prediction accuracy of DGS is 41.60%,

again with the �20 criterion. The accuracies from these

methods are significantly lower than our current BP net-

work method. Our higher performance than that of the

DomCut method is not too surprising, since only the

linker index is used in the DomCut method, while eight

more descriptors are used in our current method in

addition to the linker index. The relatively low perform-

ance of the DGS method, on the other hand, is probably

related to the simple approach used in DGS—only a distri-

bution of domain lengths is used.23 To some extent, DGS

is similar to our RPPI index in the underlying physics and

chemistry. Given the seemingly random distribution of

hydrophobic and hydrophilic residues in the sequence, it

takes some balance and certain size for a protein to form

individual domains by burying its hydrophobic residues in

the core while exposing the hydrophilic residues to the sur-

face at the same time (although it has low accuracy, it is a

neat idea).

Performance on the PPRODO dataset

To further evaluate the performance of our current

method, we then carried out the same 10-fold cross valida-

tion calculations on another larger, third-party dataset,

which was proposed recently by Sim et al.28 along with

the PPRODO method. This PPRODO method is based on

the hypothesis that the domain boundaries can be detected

by investigating the sequence evolutionary information

throughout the process of gene–exon shuffling. It utilizes

Table IIIAccuracy from the 10-Fold Cross-Validation Calculations on the 238 Protein

Dataset (Built from the Common Sets from Both SCOP and CATH

Classifications)

Window size Accuracy (%)

7 67.5911 69.2715 66.3419 66.3423 66.32

The results from other window sizes are also shown.

L. Ye et al.

304 PROTEINS

the position-specific scoring matrix (PSSM) generated

from PSI-BLAST14 search to train their neural network.28

The associated dataset consists of 522 two-domain pro-

teins, which are extracted from SCOP9 database with less

than 30% sequence identities. It was reported that the pre-

diction accuracies of PPRODO, DGS, and DomCut on this

dataset are 65.5%, 41.7%, and 27.1% respectively.28 Again,

with a window size of 11 and a �20 criterion, we have

obtained the accuracy of 62.0% for the PPRODO dataset.

Our prediction accuracy is slightly lower than that of

PPRODO. It should be noted that although PPRODO has

obtained a slightly higher accuracy of 65.5% on this 522

protein dataset, this high accuracy probably relies heavily

on the PSSM matrix from the expensive multiple sequence

alignments. However, no multiple sequence alignment is

needed or used in our method. Our future work might

include this multiple sequence alignment information as

well, to further improve the accuracy.

Again, the DomCut24 and DGS23 methods achieved

significantly lower performances than our current method,

with accuracies of 41.7% and 27.1%, respectively, versus

our 62.0%.

Another intersting thought is to combine the PPRODO

method and our current neural network method to take

advantage of the both worlds—the benefit of the physical/

chemical properties and the multiple sequence alignment.

The results show that indeed we can improve the predic-

tion accuracy by combining these two methods. We down-

loaded the PPRODO program from the website, http://

gene.kias.re.kr/jlee/pprodo/ (as well as the PPRODO data-

set aforementioned). A simple approach is used for the

combination: (i) if both methods predict the same bound-

ary (within the criterion used, �20 residues), we take our

prediction as is; and (ii) if the two methods predict differ-

ent boundaries, we take the average position of the two.

The thinking is simple—if the two predicted boundaries

are the same or close enough, then it might have a high

probability that each method gets it right; on the other

hand, if they are very different, it is more likely that both

are wrong, so we take the average of the two to improve

the odds. For our 238-protein dataset, the final accuracy

has been improved to 74.2% from 69.3% with this com-

bined approach, and for the 522-protein PPRODO data-

set, the final accuracy has been improved to 69.2% from

62.0%. These results indicate that a combined approach

does take advantage of the both methods.

Relative importance of descriptors

As mentioned earlier, we employed nine descriptors to

predict the domain boundary in a protein chain. It is of

great interest to investigate what the relative importance is

for each descriptor and which ones contribute most to the

final accuracy. We thus perform another nine similar 10-

fold cross validations but with one less descriptor each

time (i.e., removing one descriptor from the total nine).

The final prediction accuracy results are summarized in

Table IV. Figure 3 also shows some detailed results (as well

as statistical variations, more later) of the 10-fold valida-

tion calculations. The HSNC (percentage of a-helix and

b-strand residues from the N- and C-terminals), RPPI

(the relative position probability index), and the relative

solvent accessibility are found to be the top three descrip-

tors. The importance of the RPPI index and solvent acces-

sibility index might make sense, since for a two-domain

protein to show a stable and well-defined structure, the

relative domain sizes might be somewhat balanced, and

the inter-domain region will likely be buried from the sol-

vent. However, the underlying physics of the importance

Table IVThe Analysis of the Relative Importance of the Nine Descriptors

Descriptor removedCross-validation(238 dataset %)

Cross-validation(PPRODO dataset %)

None 69.27 62.01- flexibility 68.44 60.47- entropy 68.02 60.43- hydrophobicity 68.01 60.03- secondary structure 67.23 59.27- linker 66.76 61.21- AHNC 65.51 59.85- relative accessibility 65.11 57.53- RPPI 64.24 59.28- HSNC 63.82 57.72

These 10-fold cross-validation results are obtained by removing one and only one

descriptor from the input data each time.

Figure 3The statistical variation of the prediction accuracy in the 10-fold cross

validation, when RPPI, HSNC, secondary structure, or none, is removed from

input data. Again, this is for our 238-protein dataset. The relatively large

statistical variation (5–8% standard deviations) is related to the small size of

the test set (an average of 24 proteins)—a single protein mis-prediction will

result in a 4.2% drop in the accuracy. Thus, the 5–8% standard deviation

indicates a 1–2 proteins variation in the total number of correctly predicted

proteins, which is not too bad. See text for more discussions.


PROTEINS 305

of the HSNC index (percentage of a-helix and b-strandresidues from the N- and C-terminals) is not immediately

clear—maybe the number of well-defined secondary struc-

ture residues (a-helix or b-sheet) need to be balanced

somehow in the two-domain protein structures as well.

The final prediction accuracy drops about 5.5% (from

69.3% to 63.8%) if HSNC is removed from the input

data. However, the differences among these descriptors are

not that large, with the drop in accuracy ranging from

0.8% (removing flexibility) to 5.5% (removing HSNC).

These results indicate that the domain boundary informa-

tion (and maybe other structural information as well) are

mutually contained in many of these descriptors, for

example, as we know, the hydrophobicity index and rela-

tive solvent accessibility might be closely correlated. To

further complicate the situation, the slight differences

from these descriptors might be buried in the noises from

the random initialization and the training mechanism of

the BP neural network. These results also indicate that in

order to further improve the accuracy, more and better

descriptors are still in great demand.

Does this relative importance play a similar role in the

larger PPRODO dataset? To address this question, we

have also performed the similar nine cross-validation

tests on this 522 protein dataset by removing one des-

criptor at each time. The final prediction accuracy of

these cross validation calculations are also summarized in

Table IV. Similarly, the relative accessibility, HSNC and

RPPI are found to be the top three descriptors, with the

prediction accuracy dropping from 62.0% to 57.5%,

57.7%, and 59.8%, respectively, once the descriptor is

removed from the input.

Finally, it should be pointed out that the commonly

used 10-fold cross-validation generates a large statistical

variation (5–8% standard deviation) in our prediction ac-

curacy, particularly for our smaller 238-protein dataset as

shown in Figure 3. Obviously, with a 10-fold cross-valida-

tion, the test set has only about 24 proteins, so a single

protein mis-prediction can result in a 4.2% drop in the

accuracy. The 5–8% standard deviation seen in Figure 3

indicates a 1–2 protein deviation in the total number of

correctly predicted proteins, which is not too bad. For fur-

ther validation, we have performed a five-fold cross vali-

dation. As expected, the statistical variation gets much

smaller—the 5-fold prediction accuracies are found to be

62.50%, 57.45%, 62.50%, 63.83%, and 64.58%, respec-

tively, which gives an average accuracy of 62.2% with a

standard deviation of 2.4%. In addition, we have further

performed the training on the PPRODO dataset (522 pro-

teins) and test on our 238-protein dataset. There are 82

proteins common in both datasets, so we have removed

these common proteins from the training set (522 2 82 5434 proteins, while the test set remains the same with 238

proteins) to avoid an artificially higher accuracy. A decent

accuracy of 65.4% has been achieved for this much larger

test set, which indicates our neural network is fairly robust.

CONCLUSION

We have presented a BP neural network based method

to identify the domain boundary of two-domain pro-

teins. We have achieved a prediction accuracy of 69%

(with the commonly used �20 criterion) from the 10-

fold cross validation on a 238 proteins dataset that we

built based on a common set from both SCOP and

CATH classifications. The method is then applied to a

larger third-party dataset with 522 proteins, and an accu-

racy of 62% has been achieved. Our prediction results on

both datasets are found to be significantly better than

those from some other methods, such as DomCut and

DGS on the same datasets, and also comparable to that

from the PPRODO method upon which the larger data-

set is based. Our cross validation results are also notice-

ably better than previous results from other BP neural

network implementations, probably because we have

used more property descriptors with significantly more

training nodes in our network. Furthermore, our relative

importance analysis reveal that the HSNC (percentage of

helix and strand residues from the N- and C-terminals),

RPPI (the relative position probability index), and the

relative accessibility are the top three descriptors, even

though the differences among these descriptors are not

that large. These results also indicate that the domain

boundary information (and maybe other structural infor-

mation as well) are often mutually contained in many of

these descriptors. Thus, in order to further improve the

accuracy, more and better descriptors are still needed.

The future work will include the extension of the cur-

rent method to multi-domain proteins, and the design of

new independent, orthogonal property descriptors (not

included in current ones). The future work will also

investigate the possible accuracy improvement by the fur-

ther addition of similarity search and multiple sequence

alignments.

ACKNOWLEDGMENTS

The authors thank Jingyuan Li for many helpful dis-

cussions and Huajun Chen for help with the BP neural

network implementation.

REFERENCES

1. Rose GD. Hierarchic organization of domains in globular proteins.

J Mol Biol 1979;134:447–470.

2. Kong L, Ranganathan S. Delineation of modular proteins: domain

boundary prediction from sequence information. Brief Bioinform

2004;5:179–192.

3. Zhang Y, Chandonia J-M, Ding C, Holbrook SR. Comparative

mapping of sequence-based and structure-based protein domains.

BMC Bioinform 2005;6:77–92.

4. Ponting CP, Russell RR. The natural history of protein domains.

Ann Rev Biophys Biomol Struct 2002;31:45–71.

5. George RA, Lin K, Heringa J. Scooby-domain: prediction of globular

domains in protein sequence. Nucleic Acids Res 2005;33:W160–163.

L. Ye et al.

306 PROTEINS

6. Xu Y, Xu D, Gabow HN. Protein domain decomposition using a

graph-theoretic approach. Bioinformatics 2000;16:1091–1104.

7. Pugalenthi G, Archunan G, Sowdhamini R. Dial: a web-based server

for the automatic identification of structural domains in proteins.

Nucleic Acids Res 2005;33:W130–132.

8. Taylor WR. Protein structural domain identification. Prot Eng

1999;12:203–216.

9. Murzin AG, Brenner SE, Hubbard T, Chothia C. Scop: a structural

classification of proteins database for the investigation of sequences

and structures. J Mol Biol 1995;247:536–540.

10. Nagarajan N, Yona G. Automatic prediction of protein domains

from sequence information using a hybrid learning system. Bioin-

formatics 2004;20:1335–1360.

11. Gouzy J, Corpet F, Kahn D. Whole genome protein domain analysis

using a new method for domain clustering. Comput Chem 1999;23:

333–340.

12. Sonnhammer EL, Kahn D. Modular arrangement of proteins as

inferred from analysis of homology. Prot Sci 1994;3:482–492.

13. Gracy J, Argos P. Automated protein sequence database classifica-

tion. Bioinformatics 1998;14:164–173.

14. Altschul S, Madden T, Shaffer A, Zhang J, Zhang Z. Gapped blast

and psi-blast: a new generation of protein database searchprograms.

Nucleic Acids Res 1997;25:3389–3402.

15. Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R.

Pfam: multiple sequence alignments and hmm-profiles of protein

domains. Nucleic Acids Res 1998;26:320–322.

16. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT,

White O. Tigrfams: a protein family resource for the functional

identification of proteins. Nucleic Acids Res 2001;29:41–43.

17. Ponting CP, Schultz J, Milpetz F, Bork P. Smart: identification and

annotation of domains from signalling and extracellular protein

sequences. Nucleic Acids Res 1999;27:229–232.

18. KIm DE, Chivian D, Malmstrom L, Baker D. Automated prediction

of domain boundaries in casp6 targets using ginzu and rosettadom.

Proteins 2005;S7:193–200.

19. George RA, Heringa J. Snapdragon: a method to delineate pro-

tein structural domains from sequence data. J Mol Biol 2002;316:

839–851.

20. Rigden DJ. Use of covariance analysis for the prediction of struc-

tural domain boundaries from multiple protein sequence align-

ments. Prot Eng 2002;15:65–77.

21. Kuroda Y, Matsuo Y, Yokoyama S. Automated search of natively

folded protein fragments for high-throughput structure determina-

tion in structural genomics. Prot Sci 2000;9:2313–2321.

22. George RA, Heringa J. Protein domain identification and improved

sequence similarity searching using psi-blast. Proteins 2002;48:672–

681.

23. Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distribu-

tion can predict domain boundaries. Bioinformatics 2000;16:613–

618.

24. Suyama M, Ohara O. Domcut: prediction of inter-domain linker

regions in amino acid sequences. Bioinformatics 2003;19:673–

674.

25. Galzitskaya OV, Melnik BS. Prediction of protein domain bounda-

ries from sequence alone. Prot Sci 2003;12:696–701.

26. Liu J, Rost B. Sequence-based prediction of protein domains.

Nucleic Acids Res 2004;32:3522–3530.

27. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thorn-

ton JM. Cath–a hierarchic classification of protein domain struc-

tures. Structure 1997;5:1093–1108.

28. Sim J, Kim S-Y, Lee J. Pprodo: prediction of protein domain boun-

daries using neural networks. Proteins 2005;59:627–632.

29. Day R, Beck DA, Armen RS, Daggett V. A consensus view of fold

space: combining scop, cath, and the dali domain dictionary. Prot

Sci 2003;12:2150–2160.

30. Mika S, Rost B. UniqueProt: creating representative protein

sequence sets. Nucleic Acids Research 2003;31:3789–3791.

31. Cheng J, Sweredoski MJ, Baldi P. Dompro: protein domain predic-

tion using profiles, secondary structure, relative solvent accessibility,

and recursive neural networks. Data Min Knowl Discov 2005;13:1–

10.

32. Cuff JA, ClampME, Siddiqui AS, Finlay M, Barton GJ. Jpred: a consen-

sus secondary structure prediction server. Bioinformatics: 1998;14:892–

893.

33. Chen H, Zhou HX. Prediction of solvent accessibility and sites of delete-

rious mutations from protein sequence. Nucleic Acid Res 2005;33:3193–

3199.

34. Frishman D, Argos P. Knowledge-based secondary structure assign-

ment. Proteins: 1995;23:566–579.

35. Hirakawa H, Muta S, Kuhara S. The hydrophobic cores of proteins

predicted by wavelet analysis. Bioinformatics 1999;15:141–148.

36. Vihinen M, Torkkila E, Riikonen P. Accuracy of protein flexibility

predictions. Proteins: 1994;19:141–149.

37. Dumontier M, Yao R, Feldman HJ, Hogue CWV. Armadillo: do-

main boundary prediction by amino acid composition. J Mol Biol

2005;350:1061–1073.

38. Kyte J, Doolittle RF. A simple method for displaying the hydro-

pathic character of a protein. J Mol Biol 1982;157:105–132.


PROTEINS 307

sequence-based protein domain boundary prediction using bp neural network with various property...

Documents