support vector machines for classification of homo-oligomeric proteins by incorporating subsequence...
TRANSCRIPT
Support vector machines for classification of homo-oligomeric
proteins by incorporating subsequence distributions
Jie Songa,b,*, Huanwen Tanga
aDepartment of Applied Mathematics, Institute of Computational Biology and Bioinformatics, Dalian University of Technology,
Dalian 116025, People’s Republic of ChinabDepartment of Mathematics, Shaoguan University, Shaoguan 512005, People’s Republic of China
Received 24 July 2004; accepted 4 February 2005
Available online 16 March 2005
Abstract
Support vector machine approach is applied to classifying the protein homo-oligomers from the primary structure. For training and testing
protein primary sequences, their subsequence distributions act as input vectors of support vector machine, therefore, the information of
protein sequences is sufficiently taken into account. Our tests demonstrate that the residue order along protein sequences plays an important
role in the recognition of the homo-oligomers, and the support vector machine method is an effective tool for the prediction of protein
multimeric states. It was also confirmed that protein primary sequence encodes quaternary structure information.
q 2005 Elsevier B.V. All rights reserved.
Keywords: Protein quaternary structure; Homo-oligomers; Support vector machine; Subsequence distribution; Classification
1. Introduction
Protein structure plays a key role in cell biology,
biochemistry and molecular biology. Protein structure is
organized hierarchically from so-called primary structure to
quaternary structure. Primary structure is the sequence of
residues in the polypeptide chain. Secondary structure refers
to regular, repeated patterns of folding of the protein
backbone. Tertiary structure is the full three-dimensional
folded structure of the polypeptide chain. Quaternary
structure, the focus of this paper, arises when a protein
consists of more than one polypeptide chain. With multiple
polypeptide chains, quaternary structure describes the
spatial organization of the chains. The structure of a protein
can be determined by physical methods. But this is a slow
and expensive process. Owing to the dramatic increase in
the numbers of proteins sent to the public data bank during
the past few years, it is highly desirable to develop some
effective computational methods to predict the structure of
0166-1280/$ - see front matter q 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.theochem.2005.02.002
* Corresponding author. Tel.: C86 411 8470 0927; fax: C86 411 8470
9304.
E-mail address: [email protected] (J. Song).
new proteins so as to expedite the process of deducing their
function. The primary structure is unique for each protein. It
is generally accepted that a protein’s primary structure is
enough to determine how it will fold and combine with
other proteins to make the appropriate secondary, tertiary
and quaternary structure [1,2]. Prediction of protein
structure from amino acid sequences is still a difficult
problem and an active study area of contemporary
molecular biology.
Proteins with quaternary structure are said to be
oligomeric (or multimeric), and the individual chains are
called subunits. A considerable range of oligomers is found
in proteins from dimeric creatine kinase to octomeric
tryptophanase, and ribulose diphosphate carboxylase, which
has 16 subunits. Oligomeric proteins are either homo-
oligomeric, consisting of identical subunits or hetero-
oligomeric, consisting of different subunits. Arrangement
of subunits in the oligomeric structure can also vary. An
oligomeric protein is more than the sum of its parts and have
important properties not shared with its separated subunits.
A variety of bonding interactions including hydrogen
bonding, salt bridges, and disulfide bonds hold the various
subunits into a particular geometry. Klotz et al. [3]reviewed
a number of quaternary structure properties such as
stoichiometric constitution, the geometric arrangements of
Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101
www.elsevier.com/locate/theochem
J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–10198
the subunits, the assembly energetics, intersubunit com-
munication, and their functional aspects. Some recent works
have paid more attention to analyzing protein–protein
interactions and predicting interactions sites [4–8]. Garian
found a rule-based classifier by using decision tree model
and amino acid indices to discriminate between the primary
sequences of homodimers and non-homodimers [9], which
confirmed that protein primary sequence encodes quatern-
ary structure information.
In this present work, we use a new machine learning
method, i.e. support vector machine (SVM) [10,11], to
discriminate between homodimers, homotrimers, homo-
tetramers and homohexamers from primary homo-oligo-
meric protein sequences. In the new method, we introduce
the subsequence distributions of primary protein
sequences, and take them as feature vectors. In the last
few years, SVM has been introduced to solve many
biological pattern recognition problems such as micro-
array data analysis [12], protein fold recognition [13],
prediction of protein–protein interaction [6], prediction of
subcellular location [14] and prediction of protein
secondary structure [15]. Amino acid composition of a
protein sequence has been used in many studies of protein
[16–19]. Subsequence distribution of a primary sequence,
a generalization of its amino acid composition, can
sufficiently utilize the information of the sequence, and
has been successfully applied to the study of phylogeny
[20,21] and the prediction of protein structural classes
[22]. The combining of SVM with subsequence distri-
bution leads to a good predictive results for the problem
of classifying homo-oligomeric proteins.
2. Data and methods
2.1. Datasets
Robert Garian selected a data set of homo-oligomeric
sequences from Release 34 of the SWISS-PROT database
[9]. It was limited to the prokaryotic, cytosolic subset of
homo-oligomers in the database in order to eliminate
membrane proteins and other specialized proteins. We
selected a subset R1568 of Robert Garian’s dataset as our
first dataset. It was consisted of 1568 homo-oligomeric
protein sequences, 914 of which were homodimers (2EM),
139 homotrimers (3EM), 407 homotetramers (4EM) and
108 homohexamers (6EM). In the present work, we first
use this dataset to test the effectiveness of our method. In
order to investigate the influence of the dataset size and the
sample unbalance between the four classes, we randomly
extracted 108 sequences from each class (2EM, 3EM, 4EM
and 6EM) of the dataset R1568, and formed them into the
dataset R432 consisting of 432 homo-oligomeric protein
sequences.
2.2. Subsequence distribution
In our method we describe homo-oligomeric protein
sequences by their subsequence distributions [20], and take
the subsequence distributions as input vectors of SVM. In
the following, we first give the concept of subsequence
distribution, and then make a simple discussion about it.
Let S be an alphabet of m symbols, and suppose that S is
a sequence formed from S. All different contiguous
sequences formed from S with length of l are grouped
into a set Ql, and then the number of all sequences in Ql
equals ml. For a sequence S, let L be its length, and nliðl%LÞ
denote the number of contiguous subsequences in S, which
matches the ith sequence in Ql. Obviously, the total number
of subsequences of length l from S isPmðlÞ
iZ1 nliZLK lC1,
for each l%L. Define pliZnl
i=LK lC1, so we obtain a
subsequence distribution:
UlS Z ðpl
1; pl2;.; pl
mlÞ
Thus, for each sequence S with length of L, there is a
unique set of distributions
fU1S ;U
2S ;.;UL
S g
which contains all the composition information of sequence
S, it is called complete information set of the sequence S
[20]. Any different sequences have different complete
information sets, and vice versa.
For protein sequences, the 20 amino acids form the
alphabet SZ{A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,
T, V, W, Y}. Suppose S is a protein sequence. When lZ1,
U1S Z ðp1
1; p12;.; p1
20Þ becomes the conventional amino acid
composition. When lZ2, U2S Z ðp2
1; p22;.; p2
400Þ includes all
information of the first-order coupled composition intro-
duced by Liu and Chou [18] in secondary structure content
prediction. According to the construction of complete
information sets, the longer the subsequence is, the more
information it includes. When lZ1, no information about
the residue order is considered, there are a great number of
sequences with the same amino acid composition. But when
lR2, the residue order along a sequence is contained in its
subsequence distribution set and the occurrence of different
sequences with the same subsequence distribution is largely
confined because the adjoining subsequences have to
overlap each other at lK1 residues. Obviously, any
sequence can be uniquely recognized by increasing the
length of subsequence.
2.3. Support vector machine
SVM is a new pattern recognition tool theoretically
founded on Vapnik’s statistical learning theory [10]. SVM,
originally designed for binary classification, employs
supervised learning to find the optimal separating hyper-
plane between the two groups of data. Having found such a
plane, SVM can then predict the classification of an
J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101 99
unlabeled example by asking on which side of the
separating plane the example lies. SVM acts as a linear
classifier in a high dimensional feature space originated by a
projection of the original input space, the resulting classifier
is in general non-linear in the input space and it achieves
good generalization performances by maximizing the
margin between the two classes. In the following we give
a short outline of construction of SVM.
Consider a set of training examples
fðxi; yiÞg; xi 2Rn; yi 2fC1;K1g; i Z 1;.;m
where the xi are real n-dimensional pattern vectors and the yi
are dichotomous labels.
SVM maps the pattern vectors x2Rn into a possibly
higher dimensional feature space (zZf(x)) and construct an
optimal hyperplane w$zCbZ0 in feature space to separate
examples from the two classes. For SVM with L1 soft-
margin formulation, this is done by solving the primal
optimization problem
minw;b;xi
1
2jjwjj2 CC
Xm
iZ1
xi
s:t: yiðw$zi CbÞR1 Kxi; xiR0; i Z 1; 2;.;m
where C is a regularization parameter used to decide a trade-
off between the training error and the margin, and xi,
iZ1; 2;.;m, are slack variables.
The above problem is computationally solved using the
solution of its dual form
maxa
Xm
iZ1
ai K1
2
Xm
i;jZ1
aiajyiyjkðxi$xiÞ
s:t:Xm
iZ1
aiyi Z 0; 0%ai%C; i Z 1; 2;.;m
where kðxi; xjÞZfðxiÞ$fðxjÞ is the kernel function that
implicitly define a mapping f.
The resulting decision function is
f ðxÞ Z sgnXm
iZ1
aiyikðxi; xÞCb
( )
All kernel functions have to fulfill Mercer theorem (see
[11] p. 33). Most commonly used kernel functions are
polynomial kernel kðxi; xjÞ Z ðaðxi$xjÞCbÞd
radial basis function kernel kðxi; xjÞ Z expðKgjjxi Kxjjj2Þ
The SVM learning algorithm requires to solve a
quadratic programming problem. This can be accomplished
through some quadratic programming packages, or using
specific methods such as sequential minimum optimization
[23]. In practice, one can use open software available on the
web. In this paper, we use the software LIBSVM (version
2.5), which is an integrated package for SVM classification
and regression, and is very easy to operate [24].
In this paper, the classification of homodimers, homo-
trimers, homotetramers and homohexamers is a four-class
problem. For multiclass SVM methods, either several
binary classifiers have to be constructed or a larger
optimization problem is needed. The software we used
chooses ‘one-against-one’ method based on binary classi-
fication [25].
3. Results and discussion
3.1. Classification assessment
In order to examine the predictive quality of the present
prediction method, we carry out both re-substitution test
and jackknife test with different subsequence lengths from
1 to 4 on the two datasets R1568 and R432. A re-substitution
test is usually taken as an examination for its self-
consistency, and a prediction method is not considered as
a good one if its self-consistency is poor. A jackknife test,
also called leave-one-out test, is deemed as the most
effective cross-validation test; the memorization effect
included in the re-substitution test can be removed in a
jackknife test.
Two performance measures are used to assess the ability
of the SVM classifier for the testing data, which are the
overall accuracy (Q) and the prediction accuracy for each
class (Qi), respectively, and defined as
Q Z
PkiZ1 pðiÞ
N; Qi Z
pðiÞ
Ni
where N is the total number of sequences, i is the class
number, p(i) is the number of correctly predicted sequences
of i class homo-oligomers, and Ni is the number of
sequences observed in i class homo-oligomers.
3.2. SVM parameters selection
For a given dataset, only the regularity parameter C
and the kernel function must be selected to specify one
SVM. In our experiment using LIBSVM we select radial
basis function kernel, and choose regularity parameter
C from {100,101,.,105}, the kernel parameter g from
{2K10,2K9,.,210}, to get the best accuracy. Through
tuning, we finally set CZ1000 for the two datasets and 4
different subsequence lengths, set gZ29; 28, 26 and 20,
respectively, for subsequence length lZ1–4 on the dataset
R1568, and set gZ28; 27; 23; and 22, respectively, for
subsequence length lZ1–4 on the dataset R432.
3.3. The re-substitution and jackknife testing results
of SVMs with different subsequence lengths
Our method was first tested by the re-substitution
operation for different subsequence lengths from 1 to 4.
Table 1
Results of jackknife tests of SVM method with different subsequence lengths based on the data set R1568 (RBF kernel function, CZ1000, gZ29; 28; 26 and 20,
respectively, for subsequence length lZ1–4)
Subsequence length The prediction accuracy for each class Qi (%) The overall accuracy
Q (%)2EM 3EM 4EM 6EM
lZ1 93.11 53.96 61.67 40.74 77.87
lZ2 95.51 57.55 68.06 50.93 81.95
lZ3 96.83 62.59 71.74 58.33 84.63
lZ4 97.05 63.31 72.24 61.11 85.14
J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101100
As a result, SVM method predicts totally correctly all homo-
oligomers for the four subsequence lengths on the two
dataset. So all performance measures attain to 100%.We do
not give the results in the form of table due to the same
accuracy.
Table 1 gives the results of jackknife tests of SVM
method with different subsequence lengths from 1 to 4 on
the dataset R1568. When lZ4, SVM method performs
best, the overall accuracy attains to 85.14%. It is observed
that the predictive accuracies improve quickly when the
length of subsequence increases from 1 to 3, and the
increasing extent begins to reduce when lZ4. It is also
noticed that the homodimers are easier to be recognized than
the non-homodimers.
Table 2 gives the results of jackknife tests of SVM
method with different subsequence lengths from 1 to 4 on
the dataset R432. When lZ4, SVM method performs best,
the overall accuracy attains to 82.87%. It is seen that the
predictive accuracies improve more quickly when the length
of subsequence increases from 1 to 3 on R432 than on R1568,
and the increasing extent also begins to reduce when lZ4.
In addition, we can find that the difference of predictive
accuracies between the four classes decreases on R432.
3.4. Discussion
From Tables 1 and 2, we can see that the dataset size and
the sample unbalance between classes have great influence
to the prediction accuracy. In general, increasing the
number of the training set and decreasing the unbalance of
the samples between classes can improve the predictive
performance.
For re-subsitution tests with different subsequence
lengths all the predictive accuracy reach 100%. It indicates
that after being trained, the hyperplane output of the SVM
Table 2
Results of jackknife tests of SVM method with different subsequence lengths based
respectively, for subsequence length lZ1–4)
Subsequence length The prediction accuracy for each class Qi (%)
2EM 3EM 4
lZ1 61.11 79.63 5
lZ2 63.89 82.41 6
lZ3 78.70 87.96 7
lZ4 79.63 88.89 8
has captured the complicated relationship between the
composition information of a protein sequence and its
multimeric state, and SVM can be applied to predict
numbers of subunits.
For the method in this paper, the coupling effect of
closest residues is taken into account by decomposing a
sequence into subsequences. Moreover, the effect of residue
order along the sequence is involved through the overlaps of
subsequences; the longer the subsequence is, the more
information of the original sequence it includes.
From the jackknife test results, we notice that when the
length of subsequence increases from 1 to 4, the correctness
rate has the tendency to ascend for all tests. This trend
suggests that prediction quality can be improved by
increasing the length of subsequence. This fact does not
mean that the longer the subsequence is, the better the result
is. When the length of subsequence increases to some
degree, computational errors contribute most of the
differences in classification processes. It is usually appro-
priate to take lZ3,4 for protein sequences. So we do not
worry about the fulfillment problem of SVM caused by the
high dimension of input vectors.
4. Conclusion
In this paper, the method of support vector machine is
applied to the classification of the protein homo-oligomers
from the primary structure, where subsequence distributions
of primary sequences act as input vectors of SVM, by which
the effects of residue order along sequences are taken into
account. The tests show that the new method has a good
predictive quality and the residue order plays an important
role in the recognition of protein multimeric states. Our
experiment also confirms that protein primary sequence
on the data set R432 (RBF kernel function, CZ1000, gZ28; 27; 23 and 22,
The overall accuracy
Q (%)EM 6EM
9.26 65.74 66.90
8.52 75.93 72.69
8.70 81.48 81.71
0.56 82.41 82.87
J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101 101
encodes quaternary structure information. In order to
improve further predictive accuracy, we may try to extract
other sequence descriptor from primary protein sequence as
feature vector of SVM, which can grasp most essential
information reflecting quaternary structure. In addition, it is
anticipated that the SVM method may be combined with
other prediction methods to become a very useful tool for
predicting protein multimeric states.
Acknowledgements
This work has been supported by the Chinese NSF under
the grants 90103033. The authors would like to thank
Dr Robert Garian (School of Computational Sciences,
George Mason University, USA) for sending the dataset.
References
[1] C.B. Anfinsen, E. Haber, M. Sela, F.H. White, The kinetics of
formation of native ribonuclease during oxidation of the reduced
polypeptide chain, Proc. Natl Acad. Sci. USA 47 (1961) 1309–1314.
[2] C.B. Anfinsen, Principles that govern the folding of protein chains,
Science 181 (1973) 223–230.
[3] I.M. Klotz, D.M. Darnall, N.R. Langerman, Quaternary structure of
proteins, in: H. Neurath, R.L. Hill (Eds.), The Proteins vol. 1,
Academic Press, New York, 1975, pp. 293–411.
[4] E.M. Marcotte, M. Pellegrini, H.L. Ng, D.W. Rice, T.O. Yeates,
D. Eisenberg, Detecting protein function and protein–protein
interactions from genome sequences, Science 285 (1999) 751–753.
[5] F. Glaser, D.M. Steinberg, I.A. Vakser, N. Ben-Tal, Residue
frequencies and pairing preference at protein–protein interfaces,
Proteins: Struct. Funct. Genet. 43 (2001) 89–102.
[6] J.R. Bock, D.A. Gough, Predicting protein–protein interactions from
primary structure, Bioinformatics 17 (2001) 455–460.
[7] I.M.A. Nooren, J.M. Thornton, Structural characterization and
functional significance of transient protein–protein interactions,
J. Mol. Biol. 325 (2003) 991–1018.
[8] Y. Ofran, B. Rost, Analysing six types of protein–protein interfaces,
J. Mol. Biol. 325 (2003) 377–387.
[9] R. Garian, Prediction of quaternary structure from primary structure,
Bioinformatics 17 (2001) 551–556.
[10] V. Vapnik, Statistical learning theory, Wiley, New York, 1998.
[11] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector
Machines, Cambridge University Press, Cambridge, 2000.
[12] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, Support vector
machine classification and validation of cancer tissue samples using
microarray expression data, Bioinformatics 16 (2000) 906–914.
[13] C.H.Q. Ding, I. Dubchak, Multi-class protein fold recognition using
support vector machines and neural networks, Bioinformatics 17
(2001) 349–358.
[14] Y.D. Cai, X.J. Liu, X.B. Xu, K.C. Chou, Support vector machines for
prediction of protein subcellular location by incorporating quasi-
sequence-order effect, J. Cell Biochem. 84 (2002) 343–348.
[15] S. Hua, Z. Sun, A novel method of protein secondary structure
prediction with high segment overlap measure: support vector
machine approach, J. Mol. Biol. 308 (2001) 397–407.
[16] K.C. Chou, Prediction of cellular attributes using pusedo-amino acid
composition, Proteins: Struct. Funct. Genet. 43 (2001) 246–255.
[17] I. Dubchak, I. Muchnik, C. Mayor, S.-H. Kim, Recognition of a
protein fold in the context of the SCOP classification, Proteins: Struct.
Funct. Genet. 35 (1999) 401–407.
[18] W.M. Liu, K.C. Chou, Prediction of protein secondary structure
content, Protein Eng. 12 (1999) 1041–1050.
[19] C.T. Zhang, K.C. Chou, An optimization approach to predicting
protein structural class from amino acid composition, Protein Sci. 1
(1992) 401–408.
[20] W. Fang, F.S. Roberts, Z. Ma, A measure of discrepancy of multiple
sequences, Inform. Sci. 137 (2001) 75–102.
[21] J. Wang, W. Fang, L. Ling, R. Chen, Gene’s functional arrangement
as a measure of the phylogenetic relationships of microorganisms,
J. Biol. Phys. 28 (2002) 55–62.
[22] L. Jin, W. Fang, H. Tang, Prediction of protein structural classes by a
new measure of information discrepancy, CBAC 27 (2003) 373–380.
[23] J.C. Platt, Fast training of support vector machines using sequential
minimal optimization, in: B. Schlkopf, C. Burges, A. Smola (Eds.),
Advances in Kernel Methods—support Vector Learning, MIT Press,
Cambridge, 1998, pp. 185–208.
[24] C.-C. Chang, C.-C. Lin, LIBSVM: a library for support vector
machines, software available at http://www.csie.ntu.edu.tw/(cjlin/
libsvm, 2001.
[25] U. Krebel, Pairwise classification and support vector machines, in:
B. Schlkopf, C. Burges, A. Smola (Eds.), Advances in Kernel
Methods—support Vector Learning, MIT Press, Cambridge, 1998,
pp. 255–268.