sequence-based protein domain boundary prediction using bp neural network with various property...
TRANSCRIPT
![Page 1: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/1.jpg)
proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS
Sequence-based protein domain boundaryprediction using BP neural network withvarious property profilesLei Ye,1 Ting Liu,1 Zhaohui Wu,1 and Ruhong Zhou2,3*
1Department of Computer Science, Zhejiang University, Hangzhou, China
2 IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598
3Department of Chemistry, Columbia University, New York, New York 10027
INTRODUCTION
Identification of protein domain boundary is very important in many
protein studies, however, the domain boundary prediction from one
dimensional (1D) amino acid sequence is still one of the most challenging
problems remaining in molecular biology. A protein domain is often con-
sidered as the fundamental element of protein structure, function, and
evolution.1–3 It is believed that a protein domain can fold independently
or semi-independently into a stable and compact structure, which might
exhibit a rich evolutionary history and a specialized molecular function.4
A typical protein may be comprised of a single domain or several
domains, which are not necessarily contiguous. Given the observed ran-
dom distribution of hydrophobic residues in protein sequences, domain
formation appears to be the optimal solution for a large protein to bury
its hydrophobic residues while keeping hydrophilic ones near the surface.5
To accurately define protein structural domains based on three dimen-
sional (3D) tertiary structure itself is a difficult problem,6–8 and is cur-
rently best done manually by experts, with the SCOP domain classifica-
tion as an excellent example.9 Predicting the domain boundary from 1D
sequences alone is even more challenging. Many methods have been pro-
posed to address this important problem. According to Nagarajan and
Yona,10 there are five different categories of such methods. Here, we only
briefly list these methods and interested readers can refer to Ref. 10 for
more details.
i. Methods based on similarity search. These methods, such as
MKDOM,11 Domainer,12 and DOMO,13 either do an all-versus-all
BLAST14 search to identify segment pairs with high degree of homol-
ogy11,12 or cluster sequences into groups by comparing their amino
acid and dipeptide composition.13 They are often used to partition all
proteins within a database into domains, but these are, in general, less
accurate because of their heuristic nature.
ii. Methods based on expert knowledge. These methods rely on expert
knowledge of protein families to construct models like hidden Markov
models (HMM) and artificial neural networks to identify other mem-
*Correspondence to: Ruhong Zhou, Computational Biology Center, 1101 Kitchawan Road, Yorktown
Heights, NY 10598. E-mail: [email protected]
Received 12 February 2007; Revised 12 June 2007; Accepted 21 July 2007
Published online 11 October 2007 in Wiley InterScience (www.interscience.wiley.com).
DOI: 10.1002/prot.21745
ABSTRACT
Given the rapid growth in the number of
sequences without known structures, it is
becoming increasingly important to not only
accurately define protein structural domains
but also predict domain boundaries from the
amino-acid sequence alone. In this article, we
present a Back-Propagation (BP) neural net-
work method using 9 different sequence pro-
files, based on chemical, physical, and statisti-
cal properties, to predict the domain boundary
of two-domain proteins from one dimensional
sequences. We have achieved an accuracy of
69% with a 10-fold cross validation on a 238
nonredundant two-domain protein dataset
that we built based on a common set from both
SCOP and CATH classifications. The method
has also been applied to a larger third-party
dataset with 522 proteins; and an accuracy of
62% has been achieved. Our prediction results
on both datasets are found to be significantly
better than those from some other methods,
such as DomCut and DGS on the same data-
sets, and also comparable to that from the
PPRODO method, upon which the larger data-
set was based. Our cross validation results are
also noticeably better than previous ones from
other BP neural network methods, probably
because we have used more property descriptors
with significantly more training nodes in our
neural network. The integration with PPRODO
method also indicates that the information
obtained from our current approach is comple-
mentary to that available through multiple
sequence alignments. Moreover, the relative im-
portance of each property profile has been ana-
lyzed in detail.
Proteins 2008; 71:300–307.VVC 2007 Wiley-Liss, Inc.
Key words: protein domains; domain boundary
prediction; neural network; sequence profiling.
300 PROTEINS VVC 2007 WILEY-LISS, INC.
![Page 2: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/2.jpg)
bers of the family. PFam A,15 TigrFam,16 and
SMART17 fall in this category. These methods,
though more accurate for certain families, are often
limited by their ability to predict less known or
unknown families.
iii. Methods based on predicted 3D information. These
methods try to predict 3D tertiary structures first
and then assign domain boundaries. RosettaDOM,18
SnapDragon,19 and Rigden’s covariance analysis20
are examples of this approach. These methods use 3D
structural information but are often computationally
more expensive.
iv. Methods based on multiple sequence alignment. These
methods, such as PASS,21 Domination,22 and Yona’s
hybrid method,10 use multiple sequence alignments
to predict domain boundaries.
v. Other methods. These are methods that do not fall
into previous categories, such as domain guess by
size (DGS).23
Despite the large number of studies from several
groups, the domain boundary prediction using 1D
sequence alone is still an open problem. Most of these
methods have unsatisfactory prediction accuracies (more
later). The expert knowledge based methods, such as
TigrFam16 and SMART,17 might be more accurate for
certain families, but they often require careful manual
inspections and are generally useful for subsets of pro-
teins. In this article, we propose a Back-Propagation (BP)
neural network method based on sequence property pro-
files to predict the domain boundary. This method will
loosely fall into the first category by Nagarajan and
Yona’s classification (i.e., methods based on similarity
search). However, the similarity search is not purely based
on the amino acid sequence matching, but on the
sequence profile mapping using properties from chemi-
cal, physical, and statistical features, with significantly
higher accuracy (see later). Similar methods have been
proposed previously as well, such as DomCut24 based on
a single domain linker index, the entropy method based
on a sequence entropy profile,25 and the CHOPnet26
method based on secondary structure, solvent accessibility,
flexibility, and protein length, etc. The current method
uses nine different property descriptors based on chemical,
physical, and statistical poperties with a total of 169 input
nodes in BP neural network training.
The article will be organized as following. Methods
Section presents the details of the method, including the
protein dataset preparation, description of input nodes
to the neural network, and the architecture of the BP
neural network. Results and Discussion Section describes
the results and discussion. Finally, Conclusion Section
gives the conclusion remarks. Given the difficulty of this
domain boundary prediction problem, we focus our cur-
rent efforts on the two-domain proteins only, to see if we
can improve the accuracy from previous methods (as
well as simple benchmarks like the Equal-split method).
Very encouraging results have been achieved on both
datasets used in our study: one built by ourselves using a
common set from both SCOP9 and CATH27 classifica-
tions and the other from the PPRODO method by Lee
and coworkers.28 We also analyzed in detail the relative
importance of each descriptor in their contribution to
the overall prediction accuracy. As shown later, the BP
neural network provides a powerful learning vehicle, given
the appropriate sequence profiles of properties related to
protein domain structures.
METHODS
Data preparation
The dataset used for training and testing in this article
consists of 238 two-domain proteins (contiguous do-
mains), which was built from a common set of both
SCOP (Structural Classification of Proteins)9 and CATH
(Class Architecture Topology Homology)27 classifica-
tions. As we know, SCOP and CATH might assign
slightly differently domain boundaries for the same pro-
tein, and the agreement between these two databases are
only about 80%.29 A two-step process is followed here to
obtain this common dataset. First, all protein chains
with two contiguous domains are extracted from SCOP9
(http://scop.mrc-lmb.cam.ac.uk/scop/, version 1.69) and
CATH27 (http://www.biochem.ucl.ac.uk/bsm/cath/cath.
html, version 2.6.0), respectively. Following previous
studies,24,25,28 proteins with a domain size �40 or
�500 are eliminated because of their excessively small or
large sizes (there are very few large-sized domains any-
way). The second step is to extract those common
proteins shared by both SCOP and CATH – with their
domain boundaries ‘‘close’’ enough. Here, we define two
protein domain identifications are close if a given protein
is classified in SCOP as (a,b) (c,d) and in CATH as (a0,b0)(c0,d0), where |b 2 b0| < 5 and |c 2 c0| < 5. Approxi-
mately 3500 proteins pass the above two filters. Then, the
program uniqueProt,30 is used to reduce the sequence
redundancy in the dataset (with the HSSP-threshold set
to 5).31 A HSSP-threshold value of five corresponds
roughly to less than 25% sequence identity in a global
alignment of a length of 250 residues.31 It should be
noted that the current dataset (and method) is address-
ing a problem of two-domain proteins only, which is a
step away from the more general problem of predicting
domain boundaries in arbitrary proteins (addressed in a
follow-up study). Nevertheless, it is still a non-trivial
problem to predict the correct domain boundary for a
known two-domain protein from its sequence. Finally,
our dataset contains 238 unique two-domain protein
chains. The distribution of the protein chain length in
our dataset is summarized in Table I.
Sequence-based Protein Domain Boundary Prediction
PROTEINS 301
![Page 3: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/3.jpg)
Inputs of the neural network
There are a total of nine sequence profile descriptors
(indices) based on physical, chemical, and statistical prop-
erties in this study. We used a sliding-window size of 11
residues to smooth out the property profiles along the
amino acid sequence. Since the domain boundary typi-
cally does not appear near the N- or C-terminals, the cen-
tral residue of our windowing starts from the 26th resi-
dues (i.e., ignoring the first 25 residues) in the N-terminal,
and ends at the 26th residues from the C-terminal. For
each residue in this local sequence window, we calculated
the following 8 descriptors first (see below for the 9th
one): secondary structure (3 nodes, representing helix,
strand, and coil), relative solvent accessibility (1 node),
domain linker index, and averaged domain linker index (2
nodes), flexibility index and averaged flexibility index (2
nodes), hydrophobicity index and averaged hydrophobic-
ity index (2 nodes), entropy index (based on side-chain
entropy—it is the physical entropy not the statistical en-
tropy used in multiple sequence alignments) and averaged
entropy index (2 nodes), averaged hydrophobicity of resi-
dues near N- and C-terminals (2 nodes), and relative posi-
tion probability index (1 node). There are a total of 15
nodes from these eight descriptors for each residue in the
11-residue-sized window sliding average.
For better training, we used the observed secondary
structures and solvent accessibilities for training instead
of the predicted ones, even though the current method
does not require structural data in actual prediction—
only the sequence data is needed. Fairly accurate (�80%)
methods exist for the prediction of secondary struc-
tures32 and solvent accessibilities33 from amino acid
sequences. Here, the secondary structures and solvent
accessibilities are calculated from the PDB files using the
program STRIDE34 in our training and testing. Those
residues with missing coordinates in PDB files are
removed from the protein chain. Three nodes encode the
secondary structures as ‘‘helix’’, ‘‘strand’’, or ‘‘coil,’’ and
one node encodes the relative solvent accessibility as
‘‘buried’’ or ‘‘exposed’’ (under or over 20% of the total
surface area of each residue).34,35 The domain linker
index presented in DomCut,24 the flexibility index by
Vihinen,36 the hydrophobicity index and the entropy
index by Armadillo37 (which combines the indices from
the work of Kyte38 and Galzitskaya25), are employed in
our method. The averaged hydrophobicity of the residues
near the N- and C-terminals (AHNC) are defined as fol-
lows: AH(i)N- 5 (P
n51i21 h(n))/(i 2 1), AH(i)C- 5
(P
n5i11L h(n))/(L 2 i), where i is the ith residue in
protein chain, L is the length of protein chain, and h(n)
is the hydrophobicity index of the nth residue. The rela-
tive position probability index (RPPI) indicates the prob-
ability of a relative position, 0 (N-terminal) to 1 (C-ter-
minal), being the domain boundary.
To have a more accurate description of this RPPI index
and avoid over-dominance of one sized proteins, we have
tried generating the RPPI index according to different
size groups. We equally subdivided the protein size space
(50–800) into 15 groups with each group having a size of
50 residues in order to have a more accurate description
of this relative position probability with regard to differ-
ent sizes. The 238 proteins in the dataset are then binned
into these 15 groups based on their sizes (See Table I).
Each group’s normalized range (0,1) is further binned
into 20 subsections, with each protein falling into one of
these subsections. The normalized domain boundary
position for each protein can then be calculated as
(boundary position)/(chain length) and binned into these
20 subsections in its respective size group (the exact bin
size doesn’t matter much, and we have tried 15 bins and
the results do not change much). The relative position
probability index RPPI of the ith subsection in the nth
group thus is defined as follows: RPPI(i,n) 5 N(i)/
TN(n), where N(i) is the number of proteins whose rela-
tive boundary position fall into the ith subsection, and
TN(n) is the total number of proteins in this group. Part
of the RPPI extracted from the 238-protein dataset is dis-
played in Figure 1. This figure indicates that although a
large portion of the proteins have a domain boundary
near the middle of their chain lengths, there are still
many proteins whose domain boundary positions are not
near the middle of their chain lengths (RPPI �0.3 or
>0.7). This can also be seen from a simple equal-split
prediction (assuming the domain boundary is in the cen-
ter of the protein sequence, thus splitting the protein
into two equal parts)—only a �50% accuracy is
achieved, while our current method can achieve an accu-
racy of about 69% (see later). We have also tried subdi-
viding the size space (50–800) into several other groups,
such as 1, 8, or 30 groups, and the final accuracy results
Table IThe Distribution of the Protein Chain Length in Our 238 Nonredundant Two-
Domain Protein Database
Groups Chain length range Protein number
1 50–100 22 100–150 293 150–200 524 200–250 555 250–300 206 300–350 267 350–400 188 400–450 189 450–500 910 500–550 611 550–600 212 600–650 013 650–700 014 700–750 115 750–800 0
The chain length space (50–800) is subdivided equally into 15 groups with each
having a size range of 50. Each protein in the dataset is then binned into these 15
groups based on its chain length.
L. Ye et al.
302 PROTEINS
![Page 4: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/4.jpg)
show only small differences, with the 15 groups display-
ing slightly better results than the others. Thus, in the
following results section, we will use 15 groups for the
RPPI generation, while for all the other descriptors, only
one group (i.e., with all the proteins) is used.
The last descriptor (9th descriptor) is for the central
residue only of the window, called HSNC, which is the
percentage of helix and strand residues from the N- and
C-terminals (4 nodes). They are calculated as follows:
Helix(i)N- 5 HN(i)/(i 2 1), Helix(i)C- 5 HC(i)/(L 2 i),
Strand(i)N- 5 SN(i)/(i 2 1), Strand(i)C- 5 SC(i)/(L 2i), where HN(i) and SN(i) are the number of helix and
strand residues in the region from the (i 2 1)th residue
to the N-terminal, and HC(i) and SC(i) are the number
of helix and strand residues in the region from the (i 11)th residues to the C-terminal. Out of the nine descrip-
tors, the secondary structure index, relative accessibility
index, linker index, flexibility index, hydrophobicity
index, and entropy index or their variations have been
used in previous studies, with either one index24,25 or
some combinations of a few,26 while the current RPPI
index, AHNC index and HSNC index are newly designed
to catch the underlying physics in structral features of
the domain boundary. The RPPI index measures the rela-
tive size or balance between the two domains, the AHNC
index indicates the fluctuation of the average hydropho-
bicity measured from both terminals, and the HSNC
index, on the other hand, measures the percentage of a-helical and b-strand residues from both terminals. Table II
summarizes all the nine descriptors used in this study.
Neural network architecture
The standard Back-Propagation feed-forward artificial
neural network is used in our method. The network has
15 3 11 1 4 5 169 input nodes (15 nodes from the first
eight descriptors for each residue in the 11-residue-sized
window, plus 4 HSNC nodes from the ninth descriptor
for the central residue), a single hidden layer of 5 nodes,
and 1 node in the output layer. A schema of the BP neural
network architecture is shown in Figure 2. The output
node indicates whether or not the central residue in the
Table IIThe List of All Descriptors Employed in This Work
Descriptor Node Remark
Secondary structure 3 Helix, Strand, or CoilRelative accessibility 1 Exposed or buriedLinker index 2 Linker index w/ or w/o averageFlexibility 2 Flexibility index w/ or w/o averageHydrophobicity 2 Hydrophobicity index w/ or w/o
averageEntropy 2 Entropy index w/ or w/o averageAHNC 2 Averaged hydrophobicity of
residues near the N- andC-terminals
RPPI 1 Probability of a relative positionbeing domain boundary
HSNC 4 Percentage of helix and strandresidues from the N- and C-terminals
The BP network has a total of 15 3 11 1 4 5 169 input nodes (15 nodes from
the first eight descriptors for each residue in the 11-residue-sized window, plus 4
HSNC nodes from the ninth descriptor for the central residue only).
Figure 1The distribution of proteins versus the relative position probability index (RPPI).
Three representative size groups, groups 3, 4, and 5, from our 238-protein
dataset, are shown. It indicates that even though a large portion of the proteins
have a domain boundary near the middle of their chain lengths, there are still
many proteins (total 26) whose domain boundary positions are not near the
middle of their chain lengths (RPPI �0.3 or �0.7).
Figure 2Architecture of the BP neural network. With a window size of 11, each residue
in the window has 15 nodes (11 3 15 nodes), and the central residue of the
window has additional four nodes, which gives a total of 169 nodes in the
neural network.
Sequence-based Protein Domain Boundary Prediction
PROTEINS 303
![Page 5: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/5.jpg)
window is a boundary residue. The residue with maxi-
mum output score is classified as the boundary of the pro-
tein chain. Similar to the criterion used in previous stud-
ies,10,26,28 any prediction within �20 residues from the
true domain boundary residues is considered a success.
To evaluate the performance of our method based on
the BP network, a 10-fold cross validation is performed.
The dataset is divided into 10 subsets randomly: 9 sets
for training and 1 set for testing (jackknife test). Ten in-
dependent calculations are performed so that each subset
is used as the testing set once. Since the starting neural
network is initialized with random weights and bias, up
to 20 different neural networks are trained for each inde-
pendent cross-validation calculation for robustness test,
and the best training set (weights corresponding to the
best prediction accuracy) is taken as the neural network
for prediction.
RESULTS AND DISCUSSION
Performance on the common SCOP andCATH dataset
In this study, a successful domain-boundary prediction
means the predicted domain boundary residue is within a
�20 residue window from the ‘‘correct’’ domain bound-
ary, which is assigned by the SCOP9 and CATH27 classifi-
cations (as aforementioned, we had chosen only those
proteins with a common or close-enough assignment
from the both classifications). The 10-fold cross-validation
prediction results for our 238 protein data set are shown
in Table III. We have achieved a 69.3% accuracy with a
window size 11 for the 238 protein set. We have also tried
window sizes of 7, 15, 19, as well as 23, and similar 10-
fold cross validation results are summarized in Table III as
well. It shows that window size 11 has the best overall per-
formance. Of course, these results are not that much dif-
ferent across all the window sizes tested here, indicating
that the results are reasonably robust with regard to differ-
ent window sizes. This 69.3% accuracy is noticeably
higher than previous BP neural network results, such as
about 50% accuracy in the CHOPnet26 for a similar data-
set (see later). The reason for this could be the fact that
we have used more property descriptors with significantly
more training nodes in our neural network (169 nodes vs.
57 nodes in CHOPnet26). This accuracy is also signifi-
cantly higher than that from a simple equal-split predic-
tion, i.e., assuming the domain boundary is in the center
of the protein sequence, thus splitting the protein into
two equal parts. As aforementioned, only a �50% accu-
racy is achieved by this equal-splitting method (similar
results for the 522 dataset below).
As for comparison, we also applied the DomCut24 and
DGS23 methods to our current dataset (these are the
ones freely available to us on the web). In general, it is
exceptionally difficult to compare accuracies across differ-
ent methods published in literature, given the differences
in datasets, domain linker definitions, and evaluation cri-
teria. Here, for the DomCut, the predicted boundary is
the residue with the lowest value in its linker preference
profile as recommended, and the linker preference profile
result comes from the DomCut server (http://www.bork.
embl-heidelberg.de/s̃uyama/domcut/). The prediction ac-
curacy of DomCut is only 30.67% with the same �20
criterion (given its simplicity with only one linker index,
the results are not that bad). The predicted boundary of
DGS is chosen from the first prediction which is assigned
as two-continuous-domain protein. The DGS program
is downloaded from NCBI (ftp://ftp.ncbi.nih.gov/pub/
wheelan/). The prediction accuracy of DGS is 41.60%,
again with the �20 criterion. The accuracies from these
methods are significantly lower than our current BP net-
work method. Our higher performance than that of the
DomCut method is not too surprising, since only the
linker index is used in the DomCut method, while eight
more descriptors are used in our current method in
addition to the linker index. The relatively low perform-
ance of the DGS method, on the other hand, is probably
related to the simple approach used in DGS—only a distri-
bution of domain lengths is used.23 To some extent, DGS
is similar to our RPPI index in the underlying physics and
chemistry. Given the seemingly random distribution of
hydrophobic and hydrophilic residues in the sequence, it
takes some balance and certain size for a protein to form
individual domains by burying its hydrophobic residues in
the core while exposing the hydrophilic residues to the sur-
face at the same time (although it has low accuracy, it is a
neat idea).
Performance on the PPRODO dataset
To further evaluate the performance of our current
method, we then carried out the same 10-fold cross valida-
tion calculations on another larger, third-party dataset,
which was proposed recently by Sim et al.28 along with
the PPRODO method. This PPRODO method is based on
the hypothesis that the domain boundaries can be detected
by investigating the sequence evolutionary information
throughout the process of gene–exon shuffling. It utilizes
Table IIIAccuracy from the 10-Fold Cross-Validation Calculations on the 238 Protein
Dataset (Built from the Common Sets from Both SCOP and CATH
Classifications)
Window size Accuracy (%)
7 67.5911 69.2715 66.3419 66.3423 66.32
The results from other window sizes are also shown.
L. Ye et al.
304 PROTEINS
![Page 6: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/6.jpg)
the position-specific scoring matrix (PSSM) generated
from PSI-BLAST14 search to train their neural network.28
The associated dataset consists of 522 two-domain pro-
teins, which are extracted from SCOP9 database with less
than 30% sequence identities. It was reported that the pre-
diction accuracies of PPRODO, DGS, and DomCut on this
dataset are 65.5%, 41.7%, and 27.1% respectively.28 Again,
with a window size of 11 and a �20 criterion, we have
obtained the accuracy of 62.0% for the PPRODO dataset.
Our prediction accuracy is slightly lower than that of
PPRODO. It should be noted that although PPRODO has
obtained a slightly higher accuracy of 65.5% on this 522
protein dataset, this high accuracy probably relies heavily
on the PSSM matrix from the expensive multiple sequence
alignments. However, no multiple sequence alignment is
needed or used in our method. Our future work might
include this multiple sequence alignment information as
well, to further improve the accuracy.
Again, the DomCut24 and DGS23 methods achieved
significantly lower performances than our current method,
with accuracies of 41.7% and 27.1%, respectively, versus
our 62.0%.
Another intersting thought is to combine the PPRODO
method and our current neural network method to take
advantage of the both worlds—the benefit of the physical/
chemical properties and the multiple sequence alignment.
The results show that indeed we can improve the predic-
tion accuracy by combining these two methods. We down-
loaded the PPRODO program from the website, http://
gene.kias.re.kr/jlee/pprodo/ (as well as the PPRODO data-
set aforementioned). A simple approach is used for the
combination: (i) if both methods predict the same bound-
ary (within the criterion used, �20 residues), we take our
prediction as is; and (ii) if the two methods predict differ-
ent boundaries, we take the average position of the two.
The thinking is simple—if the two predicted boundaries
are the same or close enough, then it might have a high
probability that each method gets it right; on the other
hand, if they are very different, it is more likely that both
are wrong, so we take the average of the two to improve
the odds. For our 238-protein dataset, the final accuracy
has been improved to 74.2% from 69.3% with this com-
bined approach, and for the 522-protein PPRODO data-
set, the final accuracy has been improved to 69.2% from
62.0%. These results indicate that a combined approach
does take advantage of the both methods.
Relative importance of descriptors
As mentioned earlier, we employed nine descriptors to
predict the domain boundary in a protein chain. It is of
great interest to investigate what the relative importance is
for each descriptor and which ones contribute most to the
final accuracy. We thus perform another nine similar 10-
fold cross validations but with one less descriptor each
time (i.e., removing one descriptor from the total nine).
The final prediction accuracy results are summarized in
Table IV. Figure 3 also shows some detailed results (as well
as statistical variations, more later) of the 10-fold valida-
tion calculations. The HSNC (percentage of a-helix and
b-strand residues from the N- and C-terminals), RPPI
(the relative position probability index), and the relative
solvent accessibility are found to be the top three descrip-
tors. The importance of the RPPI index and solvent acces-
sibility index might make sense, since for a two-domain
protein to show a stable and well-defined structure, the
relative domain sizes might be somewhat balanced, and
the inter-domain region will likely be buried from the sol-
vent. However, the underlying physics of the importance
Table IVThe Analysis of the Relative Importance of the Nine Descriptors
Descriptor removedCross-validation(238 dataset %)
Cross-validation(PPRODO dataset %)
None 69.27 62.01- flexibility 68.44 60.47- entropy 68.02 60.43- hydrophobicity 68.01 60.03- secondary structure 67.23 59.27- linker 66.76 61.21- AHNC 65.51 59.85- relative accessibility 65.11 57.53- RPPI 64.24 59.28- HSNC 63.82 57.72
These 10-fold cross-validation results are obtained by removing one and only one
descriptor from the input data each time.
Figure 3The statistical variation of the prediction accuracy in the 10-fold cross
validation, when RPPI, HSNC, secondary structure, or none, is removed from
input data. Again, this is for our 238-protein dataset. The relatively large
statistical variation (5–8% standard deviations) is related to the small size of
the test set (an average of 24 proteins)—a single protein mis-prediction will
result in a 4.2% drop in the accuracy. Thus, the 5–8% standard deviation
indicates a 1–2 proteins variation in the total number of correctly predicted
proteins, which is not too bad. See text for more discussions.
Sequence-based Protein Domain Boundary Prediction
PROTEINS 305
![Page 7: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/7.jpg)
of the HSNC index (percentage of a-helix and b-strandresidues from the N- and C-terminals) is not immediately
clear—maybe the number of well-defined secondary struc-
ture residues (a-helix or b-sheet) need to be balanced
somehow in the two-domain protein structures as well.
The final prediction accuracy drops about 5.5% (from
69.3% to 63.8%) if HSNC is removed from the input
data. However, the differences among these descriptors are
not that large, with the drop in accuracy ranging from
0.8% (removing flexibility) to 5.5% (removing HSNC).
These results indicate that the domain boundary informa-
tion (and maybe other structural information as well) are
mutually contained in many of these descriptors, for
example, as we know, the hydrophobicity index and rela-
tive solvent accessibility might be closely correlated. To
further complicate the situation, the slight differences
from these descriptors might be buried in the noises from
the random initialization and the training mechanism of
the BP neural network. These results also indicate that in
order to further improve the accuracy, more and better
descriptors are still in great demand.
Does this relative importance play a similar role in the
larger PPRODO dataset? To address this question, we
have also performed the similar nine cross-validation
tests on this 522 protein dataset by removing one des-
criptor at each time. The final prediction accuracy of
these cross validation calculations are also summarized in
Table IV. Similarly, the relative accessibility, HSNC and
RPPI are found to be the top three descriptors, with the
prediction accuracy dropping from 62.0% to 57.5%,
57.7%, and 59.8%, respectively, once the descriptor is
removed from the input.
Finally, it should be pointed out that the commonly
used 10-fold cross-validation generates a large statistical
variation (5–8% standard deviation) in our prediction ac-
curacy, particularly for our smaller 238-protein dataset as
shown in Figure 3. Obviously, with a 10-fold cross-valida-
tion, the test set has only about 24 proteins, so a single
protein mis-prediction can result in a 4.2% drop in the
accuracy. The 5–8% standard deviation seen in Figure 3
indicates a 1–2 protein deviation in the total number of
correctly predicted proteins, which is not too bad. For fur-
ther validation, we have performed a five-fold cross vali-
dation. As expected, the statistical variation gets much
smaller—the 5-fold prediction accuracies are found to be
62.50%, 57.45%, 62.50%, 63.83%, and 64.58%, respec-
tively, which gives an average accuracy of 62.2% with a
standard deviation of 2.4%. In addition, we have further
performed the training on the PPRODO dataset (522 pro-
teins) and test on our 238-protein dataset. There are 82
proteins common in both datasets, so we have removed
these common proteins from the training set (522 2 82 5434 proteins, while the test set remains the same with 238
proteins) to avoid an artificially higher accuracy. A decent
accuracy of 65.4% has been achieved for this much larger
test set, which indicates our neural network is fairly robust.
CONCLUSION
We have presented a BP neural network based method
to identify the domain boundary of two-domain pro-
teins. We have achieved a prediction accuracy of 69%
(with the commonly used �20 criterion) from the 10-
fold cross validation on a 238 proteins dataset that we
built based on a common set from both SCOP and
CATH classifications. The method is then applied to a
larger third-party dataset with 522 proteins, and an accu-
racy of 62% has been achieved. Our prediction results on
both datasets are found to be significantly better than
those from some other methods, such as DomCut and
DGS on the same datasets, and also comparable to that
from the PPRODO method upon which the larger data-
set is based. Our cross validation results are also notice-
ably better than previous results from other BP neural
network implementations, probably because we have
used more property descriptors with significantly more
training nodes in our network. Furthermore, our relative
importance analysis reveal that the HSNC (percentage of
helix and strand residues from the N- and C-terminals),
RPPI (the relative position probability index), and the
relative accessibility are the top three descriptors, even
though the differences among these descriptors are not
that large. These results also indicate that the domain
boundary information (and maybe other structural infor-
mation as well) are often mutually contained in many of
these descriptors. Thus, in order to further improve the
accuracy, more and better descriptors are still needed.
The future work will include the extension of the cur-
rent method to multi-domain proteins, and the design of
new independent, orthogonal property descriptors (not
included in current ones). The future work will also
investigate the possible accuracy improvement by the fur-
ther addition of similarity search and multiple sequence
alignments.
ACKNOWLEDGMENTS
The authors thank Jingyuan Li for many helpful dis-
cussions and Huajun Chen for help with the BP neural
network implementation.
REFERENCES
1. Rose GD. Hierarchic organization of domains in globular proteins.
J Mol Biol 1979;134:447–470.
2. Kong L, Ranganathan S. Delineation of modular proteins: domain
boundary prediction from sequence information. Brief Bioinform
2004;5:179–192.
3. Zhang Y, Chandonia J-M, Ding C, Holbrook SR. Comparative
mapping of sequence-based and structure-based protein domains.
BMC Bioinform 2005;6:77–92.
4. Ponting CP, Russell RR. The natural history of protein domains.
Ann Rev Biophys Biomol Struct 2002;31:45–71.
5. George RA, Lin K, Heringa J. Scooby-domain: prediction of globular
domains in protein sequence. Nucleic Acids Res 2005;33:W160–163.
L. Ye et al.
306 PROTEINS
![Page 8: Sequence-based protein domain boundary prediction using BP neural network with various property profiles](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251d1a28ab877eb236e7/html5/thumbnails/8.jpg)
6. Xu Y, Xu D, Gabow HN. Protein domain decomposition using a
graph-theoretic approach. Bioinformatics 2000;16:1091–1104.
7. Pugalenthi G, Archunan G, Sowdhamini R. Dial: a web-based server
for the automatic identification of structural domains in proteins.
Nucleic Acids Res 2005;33:W130–132.
8. Taylor WR. Protein structural domain identification. Prot Eng
1999;12:203–216.
9. Murzin AG, Brenner SE, Hubbard T, Chothia C. Scop: a structural
classification of proteins database for the investigation of sequences
and structures. J Mol Biol 1995;247:536–540.
10. Nagarajan N, Yona G. Automatic prediction of protein domains
from sequence information using a hybrid learning system. Bioin-
formatics 2004;20:1335–1360.
11. Gouzy J, Corpet F, Kahn D. Whole genome protein domain analysis
using a new method for domain clustering. Comput Chem 1999;23:
333–340.
12. Sonnhammer EL, Kahn D. Modular arrangement of proteins as
inferred from analysis of homology. Prot Sci 1994;3:482–492.
13. Gracy J, Argos P. Automated protein sequence database classifica-
tion. Bioinformatics 1998;14:164–173.
14. Altschul S, Madden T, Shaffer A, Zhang J, Zhang Z. Gapped blast
and psi-blast: a new generation of protein database searchprograms.
Nucleic Acids Res 1997;25:3389–3402.
15. Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R.
Pfam: multiple sequence alignments and hmm-profiles of protein
domains. Nucleic Acids Res 1998;26:320–322.
16. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT,
White O. Tigrfams: a protein family resource for the functional
identification of proteins. Nucleic Acids Res 2001;29:41–43.
17. Ponting CP, Schultz J, Milpetz F, Bork P. Smart: identification and
annotation of domains from signalling and extracellular protein
sequences. Nucleic Acids Res 1999;27:229–232.
18. KIm DE, Chivian D, Malmstrom L, Baker D. Automated prediction
of domain boundaries in casp6 targets using ginzu and rosettadom.
Proteins 2005;S7:193–200.
19. George RA, Heringa J. Snapdragon: a method to delineate pro-
tein structural domains from sequence data. J Mol Biol 2002;316:
839–851.
20. Rigden DJ. Use of covariance analysis for the prediction of struc-
tural domain boundaries from multiple protein sequence align-
ments. Prot Eng 2002;15:65–77.
21. Kuroda Y, Matsuo Y, Yokoyama S. Automated search of natively
folded protein fragments for high-throughput structure determina-
tion in structural genomics. Prot Sci 2000;9:2313–2321.
22. George RA, Heringa J. Protein domain identification and improved
sequence similarity searching using psi-blast. Proteins 2002;48:672–
681.
23. Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distribu-
tion can predict domain boundaries. Bioinformatics 2000;16:613–
618.
24. Suyama M, Ohara O. Domcut: prediction of inter-domain linker
regions in amino acid sequences. Bioinformatics 2003;19:673–
674.
25. Galzitskaya OV, Melnik BS. Prediction of protein domain bounda-
ries from sequence alone. Prot Sci 2003;12:696–701.
26. Liu J, Rost B. Sequence-based prediction of protein domains.
Nucleic Acids Res 2004;32:3522–3530.
27. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thorn-
ton JM. Cath–a hierarchic classification of protein domain struc-
tures. Structure 1997;5:1093–1108.
28. Sim J, Kim S-Y, Lee J. Pprodo: prediction of protein domain boun-
daries using neural networks. Proteins 2005;59:627–632.
29. Day R, Beck DA, Armen RS, Daggett V. A consensus view of fold
space: combining scop, cath, and the dali domain dictionary. Prot
Sci 2003;12:2150–2160.
30. Mika S, Rost B. UniqueProt: creating representative protein
sequence sets. Nucleic Acids Research 2003;31:3789–3791.
31. Cheng J, Sweredoski MJ, Baldi P. Dompro: protein domain predic-
tion using profiles, secondary structure, relative solvent accessibility,
and recursive neural networks. Data Min Knowl Discov 2005;13:1–
10.
32. Cuff JA, ClampME, Siddiqui AS, Finlay M, Barton GJ. Jpred: a consen-
sus secondary structure prediction server. Bioinformatics: 1998;14:892–
893.
33. Chen H, Zhou HX. Prediction of solvent accessibility and sites of delete-
rious mutations from protein sequence. Nucleic Acid Res 2005;33:3193–
3199.
34. Frishman D, Argos P. Knowledge-based secondary structure assign-
ment. Proteins: 1995;23:566–579.
35. Hirakawa H, Muta S, Kuhara S. The hydrophobic cores of proteins
predicted by wavelet analysis. Bioinformatics 1999;15:141–148.
36. Vihinen M, Torkkila E, Riikonen P. Accuracy of protein flexibility
predictions. Proteins: 1994;19:141–149.
37. Dumontier M, Yao R, Feldman HJ, Hogue CWV. Armadillo: do-
main boundary prediction by amino acid composition. J Mol Biol
2005;350:1061–1073.
38. Kyte J, Doolittle RF. A simple method for displaying the hydro-
pathic character of a protein. J Mol Biol 1982;157:105–132.
Sequence-based Protein Domain Boundary Prediction
PROTEINS 307