support vector machines for classification of homo-oligomeric proteins by incorporating subsequence...

Support vector machines for classification of homo-oligomeric

proteins by incorporating subsequence distributions

Jie Songa,b,*, Huanwen Tanga

aDepartment of Applied Mathematics, Institute of Computational Biology and Bioinformatics, Dalian University of Technology,

Dalian 116025, People’s Republic of ChinabDepartment of Mathematics, Shaoguan University, Shaoguan 512005, People’s Republic of China

Received 24 July 2004; accepted 4 February 2005

Available online 16 March 2005

Abstract

Support vector machine approach is applied to classifying the protein homo-oligomers from the primary structure. For training and testing

protein primary sequences, their subsequence distributions act as input vectors of support vector machine, therefore, the information of

protein sequences is sufficiently taken into account. Our tests demonstrate that the residue order along protein sequences plays an important

role in the recognition of the homo-oligomers, and the support vector machine method is an effective tool for the prediction of protein

multimeric states. It was also confirmed that protein primary sequence encodes quaternary structure information.

q 2005 Elsevier B.V. All rights reserved.

Keywords: Protein quaternary structure; Homo-oligomers; Support vector machine; Subsequence distribution; Classification

1. Introduction

Protein structure plays a key role in cell biology,

biochemistry and molecular biology. Protein structure is

organized hierarchically from so-called primary structure to

quaternary structure. Primary structure is the sequence of

residues in the polypeptide chain. Secondary structure refers

to regular, repeated patterns of folding of the protein

backbone. Tertiary structure is the full three-dimensional

folded structure of the polypeptide chain. Quaternary

structure, the focus of this paper, arises when a protein

consists of more than one polypeptide chain. With multiple

polypeptide chains, quaternary structure describes the

spatial organization of the chains. The structure of a protein

can be determined by physical methods. But this is a slow

and expensive process. Owing to the dramatic increase in

the numbers of proteins sent to the public data bank during

the past few years, it is highly desirable to develop some

effective computational methods to predict the structure of

0166-1280/$ - see front matter q 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.theochem.2005.02.002

* Corresponding author. Tel.: C86 411 8470 0927; fax: C86 411 8470

9304.

E-mail address: [email protected] (J. Song).

new proteins so as to expedite the process of deducing their

function. The primary structure is unique for each protein. It

is generally accepted that a protein’s primary structure is

enough to determine how it will fold and combine with

other proteins to make the appropriate secondary, tertiary

and quaternary structure [1,2]. Prediction of protein

structure from amino acid sequences is still a difficult

problem and an active study area of contemporary

molecular biology.

Proteins with quaternary structure are said to be

oligomeric (or multimeric), and the individual chains are

called subunits. A considerable range of oligomers is found

in proteins from dimeric creatine kinase to octomeric

tryptophanase, and ribulose diphosphate carboxylase, which

has 16 subunits. Oligomeric proteins are either homo-

oligomeric, consisting of identical subunits or hetero-

oligomeric, consisting of different subunits. Arrangement

of subunits in the oligomeric structure can also vary. An

oligomeric protein is more than the sum of its parts and have

important properties not shared with its separated subunits.

A variety of bonding interactions including hydrogen

bonding, salt bridges, and disulfide bonds hold the various

subunits into a particular geometry. Klotz et al. [3]reviewed

a number of quaternary structure properties such as

stoichiometric constitution, the geometric arrangements of

Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101

www.elsevier.com/locate/theochem

http://www.elsevier.com/locate/theochem

J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–10198

the subunits, the assembly energetics, intersubunit com-

munication, and their functional aspects. Some recent works

have paid more attention to analyzing protein–protein

interactions and predicting interactions sites [4–8]. Garian

found a rule-based classifier by using decision tree model

and amino acid indices to discriminate between the primary

sequences of homodimers and non-homodimers [9], which

confirmed that protein primary sequence encodes quatern-

ary structure information.

In this present work, we use a new machine learning

method, i.e. support vector machine (SVM) [10,11], to

discriminate between homodimers, homotrimers, homo-

tetramers and homohexamers from primary homo-oligo-

meric protein sequences. In the new method, we introduce

the subsequence distributions of primary protein

sequences, and take them as feature vectors. In the last

few years, SVM has been introduced to solve many

biological pattern recognition problems such as micro-

array data analysis [12], protein fold recognition [13],

prediction of protein–protein interaction [6], prediction of

subcellular location [14] and prediction of protein

secondary structure [15]. Amino acid composition of a

protein sequence has been used in many studies of protein

[16–19]. Subsequence distribution of a primary sequence,

a generalization of its amino acid composition, can

sufficiently utilize the information of the sequence, and

has been successfully applied to the study of phylogeny

[20,21] and the prediction of protein structural classes

[22]. The combining of SVM with subsequence distri-

bution leads to a good predictive results for the problem

of classifying homo-oligomeric proteins.

2. Data and methods

2.1. Datasets

Robert Garian selected a data set of homo-oligomeric

sequences from Release 34 of the SWISS-PROT database

[9]. It was limited to the prokaryotic, cytosolic subset of

homo-oligomers in the database in order to eliminate

membrane proteins and other specialized proteins. We

selected a subset R1568 of Robert Garian’s dataset as our

first dataset. It was consisted of 1568 homo-oligomeric

protein sequences, 914 of which were homodimers (2EM),

139 homotrimers (3EM), 407 homotetramers (4EM) and

108 homohexamers (6EM). In the present work, we first

use this dataset to test the effectiveness of our method. In

order to investigate the influence of the dataset size and the

sample unbalance between the four classes, we randomly

extracted 108 sequences from each class (2EM, 3EM, 4EM

and 6EM) of the dataset R1568, and formed them into the

dataset R432 consisting of 432 homo-oligomeric protein

sequences.

2.2. Subsequence distribution

In our method we describe homo-oligomeric protein

sequences by their subsequence distributions [20], and take

the subsequence distributions as input vectors of SVM. In

the following, we first give the concept of subsequence

distribution, and then make a simple discussion about it.

Let S be an alphabet of m symbols, and suppose that S is

a sequence formed from S. All different contiguous

sequences formed from S with length of l are grouped

into a set Ql, and then the number of all sequences in Ql

equals ml. For a sequence S, let L be its length, and nliðl%LÞ

denote the number of contiguous subsequences in S, which

matches the ith sequence in Ql. Obviously, the total number

of subsequences of length l from S isPmðlÞ

iZ1 nliZLK lC1,

for each l%L. Define pliZnl

i=LK lC1, so we obtain a

subsequence distribution:

UlS Z ðpl

1; pl2;.; pl

mlÞ

Thus, for each sequence S with length of L, there is a

unique set of distributions

fU1S ;U

2S ;.;UL

S g

which contains all the composition information of sequence

S, it is called complete information set of the sequence S

[20]. Any different sequences have different complete

information sets, and vice versa.

For protein sequences, the 20 amino acids form the

alphabet SZ{A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,

T, V, W, Y}. Suppose S is a protein sequence. When lZ1,

U1S Z ðp1

1; p12;.; p1

20Þ becomes the conventional amino acid

composition. When lZ2, U2S Z ðp2

1; p22;.; p2

400Þ includes all

information of the first-order coupled composition intro-

duced by Liu and Chou [18] in secondary structure content

prediction. According to the construction of complete

information sets, the longer the subsequence is, the more

information it includes. When lZ1, no information about

the residue order is considered, there are a great number of

sequences with the same amino acid composition. But when

lR2, the residue order along a sequence is contained in its

subsequence distribution set and the occurrence of different

sequences with the same subsequence distribution is largely

confined because the adjoining subsequences have to

overlap each other at lK1 residues. Obviously, any

sequence can be uniquely recognized by increasing the

length of subsequence.

2.3. Support vector machine

SVM is a new pattern recognition tool theoretically

founded on Vapnik’s statistical learning theory [10]. SVM,

originally designed for binary classification, employs

supervised learning to find the optimal separating hyper-

plane between the two groups of data. Having found such a

plane, SVM can then predict the classification of an

J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101 99

unlabeled example by asking on which side of the

separating plane the example lies. SVM acts as a linear

classifier in a high dimensional feature space originated by a

projection of the original input space, the resulting classifier

is in general non-linear in the input space and it achieves

good generalization performances by maximizing the

margin between the two classes. In the following we give

a short outline of construction of SVM.

Consider a set of training examples

fðxi; yiÞg; xi 2Rn; yi 2fC1;K1g; i Z 1;.;m

where the xi are real n-dimensional pattern vectors and the yi

are dichotomous labels.

SVM maps the pattern vectors x2Rn into a possibly

higher dimensional feature space (zZf(x)) and construct an

optimal hyperplane w$zCbZ0 in feature space to separate

examples from the two classes. For SVM with L1 soft-

margin formulation, this is done by solving the primal

optimization problem

minw;b;xi

1

2jjwjj2 CC

Xm

iZ1

xi

s:t: yiðw$zi CbÞR1 Kxi; xiR0; i Z 1; 2;.;m

where C is a regularization parameter used to decide a trade-

off between the training error and the margin, and xi,

iZ1; 2;.;m, are slack variables.

The above problem is computationally solved using the

solution of its dual form

maxa

Xm

iZ1

ai K1

2

Xm

i;jZ1

aiajyiyjkðxi$xiÞ

s:t:Xm

iZ1

aiyi Z 0; 0%ai%C; i Z 1; 2;.;m

where kðxi; xjÞZfðxiÞ$fðxjÞ is the kernel function that

implicitly define a mapping f.

The resulting decision function is

f ðxÞ Z sgnXm

iZ1

aiyikðxi; xÞCb

( )

All kernel functions have to fulfill Mercer theorem (see

[11] p. 33). Most commonly used kernel functions are

polynomial kernel kðxi; xjÞ Z ðaðxi$xjÞCbÞd

radial basis function kernel kðxi; xjÞ Z expðKgjjxi Kxjjj2Þ

The SVM learning algorithm requires to solve a

quadratic programming problem. This can be accomplished

through some quadratic programming packages, or using

specific methods such as sequential minimum optimization

[23]. In practice, one can use open software available on the

web. In this paper, we use the software LIBSVM (version

2.5), which is an integrated package for SVM classification

and regression, and is very easy to operate [24].

In this paper, the classification of homodimers, homo-

trimers, homotetramers and homohexamers is a four-class

problem. For multiclass SVM methods, either several

binary classifiers have to be constructed or a larger

optimization problem is needed. The software we used

chooses ‘one-against-one’ method based on binary classi-

fication [25].

3. Results and discussion

3.1. Classification assessment

In order to examine the predictive quality of the present

prediction method, we carry out both re-substitution test

and jackknife test with different subsequence lengths from

1 to 4 on the two datasets R1568 and R432. A re-substitution

test is usually taken as an examination for its self-

consistency, and a prediction method is not considered as

a good one if its self-consistency is poor. A jackknife test,

also called leave-one-out test, is deemed as the most

effective cross-validation test; the memorization effect

included in the re-substitution test can be removed in a

jackknife test.

Two performance measures are used to assess the ability

of the SVM classifier for the testing data, which are the

overall accuracy (Q) and the prediction accuracy for each

class (Qi), respectively, and defined as

Q Z

PkiZ1 pðiÞ

N; Qi Z

pðiÞ

Ni

where N is the total number of sequences, i is the class

number, p(i) is the number of correctly predicted sequences

of i class homo-oligomers, and Ni is the number of

sequences observed in i class homo-oligomers.

3.2. SVM parameters selection

For a given dataset, only the regularity parameter C

and the kernel function must be selected to specify one

SVM. In our experiment using LIBSVM we select radial

basis function kernel, and choose regularity parameter

C from {100,101,.,105}, the kernel parameter g from

{2K10,2K9,.,210}, to get the best accuracy. Through

tuning, we finally set CZ1000 for the two datasets and 4

different subsequence lengths, set gZ29; 28, 26 and 20,

respectively, for subsequence length lZ1–4 on the dataset

R1568, and set gZ28; 27; 23; and 22, respectively, for

subsequence length lZ1–4 on the dataset R432.

3.3. The re-substitution and jackknife testing results

of SVMs with different subsequence lengths

Our method was first tested by the re-substitution

operation for different subsequence lengths from 1 to 4.

Table 1

Results of jackknife tests of SVM method with different subsequence lengths based on the data set R1568 (RBF kernel function, CZ1000, gZ29; 28; 26 and 20,

respectively, for subsequence length lZ1–4)

Subsequence length The prediction accuracy for each class Qi (%) The overall accuracy

Q (%)2EM 3EM 4EM 6EM

lZ1 93.11 53.96 61.67 40.74 77.87

lZ2 95.51 57.55 68.06 50.93 81.95

lZ3 96.83 62.59 71.74 58.33 84.63

lZ4 97.05 63.31 72.24 61.11 85.14

J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101100

As a result, SVM method predicts totally correctly all homo-

oligomers for the four subsequence lengths on the two

dataset. So all performance measures attain to 100%.We do

not give the results in the form of table due to the same

accuracy.

Table 1 gives the results of jackknife tests of SVM

method with different subsequence lengths from 1 to 4 on

the dataset R1568. When lZ4, SVM method performs

best, the overall accuracy attains to 85.14%. It is observed

that the predictive accuracies improve quickly when the

length of subsequence increases from 1 to 3, and the

increasing extent begins to reduce when lZ4. It is also

noticed that the homodimers are easier to be recognized than

the non-homodimers.

Table 2 gives the results of jackknife tests of SVM

method with different subsequence lengths from 1 to 4 on

the dataset R432. When lZ4, SVM method performs best,

the overall accuracy attains to 82.87%. It is seen that the

predictive accuracies improve more quickly when the length

of subsequence increases from 1 to 3 on R432 than on R1568,

and the increasing extent also begins to reduce when lZ4.

In addition, we can find that the difference of predictive

accuracies between the four classes decreases on R432.

3.4. Discussion

From Tables 1 and 2, we can see that the dataset size and

the sample unbalance between classes have great influence

to the prediction accuracy. In general, increasing the

number of the training set and decreasing the unbalance of

the samples between classes can improve the predictive

performance.

For re-subsitution tests with different subsequence

lengths all the predictive accuracy reach 100%. It indicates

that after being trained, the hyperplane output of the SVM

Table 2

Results of jackknife tests of SVM method with different subsequence lengths based

respectively, for subsequence length lZ1–4)

Subsequence length The prediction accuracy for each class Qi (%)

2EM 3EM 4

lZ1 61.11 79.63 5

lZ2 63.89 82.41 6

lZ3 78.70 87.96 7

lZ4 79.63 88.89 8

has captured the complicated relationship between the

composition information of a protein sequence and its

multimeric state, and SVM can be applied to predict

numbers of subunits.

For the method in this paper, the coupling effect of

closest residues is taken into account by decomposing a

sequence into subsequences. Moreover, the effect of residue

order along the sequence is involved through the overlaps of

subsequences; the longer the subsequence is, the more

information of the original sequence it includes.

From the jackknife test results, we notice that when the

length of subsequence increases from 1 to 4, the correctness

rate has the tendency to ascend for all tests. This trend

suggests that prediction quality can be improved by

increasing the length of subsequence. This fact does not

mean that the longer the subsequence is, the better the result

is. When the length of subsequence increases to some

degree, computational errors contribute most of the

differences in classification processes. It is usually appro-

priate to take lZ3,4 for protein sequences. So we do not

worry about the fulfillment problem of SVM caused by the

high dimension of input vectors.

4. Conclusion

In this paper, the method of support vector machine is

applied to the classification of the protein homo-oligomers

from the primary structure, where subsequence distributions

of primary sequences act as input vectors of SVM, by which

the effects of residue order along sequences are taken into

account. The tests show that the new method has a good

predictive quality and the residue order plays an important

role in the recognition of protein multimeric states. Our

experiment also confirms that protein primary sequence

on the data set R432 (RBF kernel function, CZ1000, gZ28; 27; 23 and 22,

The overall accuracy

Q (%)EM 6EM

9.26 65.74 66.90

8.52 75.93 72.69

8.70 81.48 81.71

0.56 82.41 82.87

J. Song, H. Tang / Journal of Molecular Structure: THEOCHEM 722 (2005) 97–101 101

encodes quaternary structure information. In order to

improve further predictive accuracy, we may try to extract

other sequence descriptor from primary protein sequence as

feature vector of SVM, which can grasp most essential

information reflecting quaternary structure. In addition, it is

anticipated that the SVM method may be combined with

other prediction methods to become a very useful tool for

predicting protein multimeric states.

Acknowledgements

This work has been supported by the Chinese NSF under

the grants 90103033. The authors would like to thank

Dr Robert Garian (School of Computational Sciences,

George Mason University, USA) for sending the dataset.

References

[1] C.B. Anfinsen, E. Haber, M. Sela, F.H. White, The kinetics of

formation of native ribonuclease during oxidation of the reduced

polypeptide chain, Proc. Natl Acad. Sci. USA 47 (1961) 1309–1314.

[2] C.B. Anfinsen, Principles that govern the folding of protein chains,

Science 181 (1973) 223–230.

[3] I.M. Klotz, D.M. Darnall, N.R. Langerman, Quaternary structure of

proteins, in: H. Neurath, R.L. Hill (Eds.), The Proteins vol. 1,

Academic Press, New York, 1975, pp. 293–411.

[4] E.M. Marcotte, M. Pellegrini, H.L. Ng, D.W. Rice, T.O. Yeates,

D. Eisenberg, Detecting protein function and protein–protein

interactions from genome sequences, Science 285 (1999) 751–753.

[5] F. Glaser, D.M. Steinberg, I.A. Vakser, N. Ben-Tal, Residue

frequencies and pairing preference at protein–protein interfaces,

Proteins: Struct. Funct. Genet. 43 (2001) 89–102.

[6] J.R. Bock, D.A. Gough, Predicting protein–protein interactions from

primary structure, Bioinformatics 17 (2001) 455–460.

[7] I.M.A. Nooren, J.M. Thornton, Structural characterization and

functional significance of transient protein–protein interactions,

J. Mol. Biol. 325 (2003) 991–1018.

[8] Y. Ofran, B. Rost, Analysing six types of protein–protein interfaces,

J. Mol. Biol. 325 (2003) 377–387.

[9] R. Garian, Prediction of quaternary structure from primary structure,

Bioinformatics 17 (2001) 551–556.

[10] V. Vapnik, Statistical learning theory, Wiley, New York, 1998.

[11] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector

Machines, Cambridge University Press, Cambridge, 2000.

[12] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, Support vector

machine classification and validation of cancer tissue samples using

microarray expression data, Bioinformatics 16 (2000) 906–914.

[13] C.H.Q. Ding, I. Dubchak, Multi-class protein fold recognition using

support vector machines and neural networks, Bioinformatics 17

(2001) 349–358.

[14] Y.D. Cai, X.J. Liu, X.B. Xu, K.C. Chou, Support vector machines for

prediction of protein subcellular location by incorporating quasi-

sequence-order effect, J. Cell Biochem. 84 (2002) 343–348.

[15] S. Hua, Z. Sun, A novel method of protein secondary structure

prediction with high segment overlap measure: support vector

machine approach, J. Mol. Biol. 308 (2001) 397–407.

[16] K.C. Chou, Prediction of cellular attributes using pusedo-amino acid

composition, Proteins: Struct. Funct. Genet. 43 (2001) 246–255.

[17] I. Dubchak, I. Muchnik, C. Mayor, S.-H. Kim, Recognition of a

protein fold in the context of the SCOP classification, Proteins: Struct.

Funct. Genet. 35 (1999) 401–407.

[18] W.M. Liu, K.C. Chou, Prediction of protein secondary structure

content, Protein Eng. 12 (1999) 1041–1050.

[19] C.T. Zhang, K.C. Chou, An optimization approach to predicting

protein structural class from amino acid composition, Protein Sci. 1

(1992) 401–408.

[20] W. Fang, F.S. Roberts, Z. Ma, A measure of discrepancy of multiple

sequences, Inform. Sci. 137 (2001) 75–102.

[21] J. Wang, W. Fang, L. Ling, R. Chen, Gene’s functional arrangement

as a measure of the phylogenetic relationships of microorganisms,

J. Biol. Phys. 28 (2002) 55–62.

[22] L. Jin, W. Fang, H. Tang, Prediction of protein structural classes by a

new measure of information discrepancy, CBAC 27 (2003) 373–380.

[23] J.C. Platt, Fast training of support vector machines using sequential

minimal optimization, in: B. Schlkopf, C. Burges, A. Smola (Eds.),

Advances in Kernel Methods—support Vector Learning, MIT Press,

Cambridge, 1998, pp. 185–208.

[24] C.-C. Chang, C.-C. Lin, LIBSVM: a library for support vector

machines, software available at http://www.csie.ntu.edu.tw/(cjlin/

libsvm, 2001.

[25] U. Krebel, Pairwise classification and support vector machines, in:

B. Schlkopf, C. Burges, A. Smola (Eds.), Advances in Kernel

Methods—support Vector Learning, MIT Press, Cambridge, 1998,

pp. 255–268.

http://www.csie.ntu.edu.tw/(cjlin/libsvm

http://www.csie.ntu.edu.tw/(cjlin/libsvm

support vector machines for classification of homo-oligomeric proteins by incorporating subsequence...

Documents