svm dbeth

Human pathogenic Bacterial ExoToxin Idenfication using Improved Feature Vector and, Support Vector Machine

Why Bacterial exotoxin identification?

• Becomes important to study there mechanism to fight against

Major cause of diseases, leading to symptoms and lesions during infection

• So species specific information is needed

There toxins are specific to a species

• Implying they are regulated by environmental signals as well, study of properties that interact with the environment becomes important

Exotoxins in particular, though completely neutralized in vivo, are only partialy inhibited in vitro

• Requires identification of new sequences

Most bacteria become resistant to antibiotics because of mutation or genetic recombination

Futher inactive exotoxins that form toxoids, still reatining the antigenic properties can be used to cure cartain disesases

Support Vector Machine?

Introduced by Vapnik, in 1992.

Set of related supervised learning methods that analyze and recognize patterns

Used for classification and regression analysis

Non-probablistic binary linear classifier

Based on statistical learning and optimization theories

Can handle multiple, continuous as well as categorical data

Principle

•Representation of examples as points in space

•Mapped such that examples of separate categories are divided by a gap as wide as possible

•Constructs a hyperplane or a set of hyperplane in high or infinite dimensional space

•Such that the hyperplane is at maximum distance from nearest data point of either of the classes

Working:Given a training set of instance-label pairs (xi , yi ), i = 1, . . . , n , where xi ∈ Rn and yi ∈ {1, −1} as below:

w/||w||

wTx + b = 0

(x1, 1)

(xn, -1)

m

Intoduction of kernel function to make computations in higher

dimenional space easier.

Original problem in finite dimensional space may not be

linearly separable , so mapped to higher dimensional space

Maximize the margin (from the nearest data points of either

classes), m = yi (wTxi + b) = 1 /||w||

Optimization problemrequire the solution of the following optimization problem:

min w,b,ξ (1/2)wTw+C Σξi,subject to yi (wT φ(xi ) + b) ≥ 1 − ξi ,ξi ≥ 0, where

φ – function mapping from input space to feature space C > 0 is the penalty parameter of the error term. ξi - error term introduced

The dual solution of the optimization problem found using Lagrange’s theorem , depends only on the inner product of the support vectors and the new vector x, to determine its class.

Kernel Function, given by K(x,z) = φ(x). φ(z) makes SVM to learn in the high dimensional feature space without having to explicitly calculate φ(x).

Kernel Function

linear: K(xi , xj ) = xT xj

polynomial: K(xi , xj ) = (γxi T xj + r)d , γ > 0

radial basis function (RBF): K(xi , xj ) = exp(−γ|xi − xj|2 ), γ > 0

sigmoid: K(xi , xj ) = tanh(γxi T xj + r).

A valid kernel function must satisfy Mercer Theorem which defines that the corresponding kernel matrix be symmetric positive semi-definite (zTKz >= 0).Following are commonly used kernel functions:

Effectivenss of SVM depends on the selection of kernel, kernel parameters and the soft margin paarmeter C.

Data Collection

and obtain the representatives

this database created after evaluating and processing over the 4750 toxin sequences from 24 different genus, retrieved from NCBI: www.ncbi.nlm.nih.gov,

to remove the redundancies,

It contained representative protein sequences from 24 different genus of human pathogenic bacteria inFASTA format

294 bacterial toxin sequences were taken from the Bacterial Toxin Database from the site http://www.hpppi/iicb.res.in/btox

To model SVM to classify human pathogenic bacterial toxins from nontoxins, 2 major databases were compiled, that of bacterial toxins and that of nontoxins.

Of the 294 toxin(positive samples) and 2940 nontoxin(negative samples) sequences,

44 toxin and 440 nontoxin set apart for testing

remaining 250 toxin and 2500 nontoxin feature vectors.

and then removing the sequences with more than 90% sequence identity using CDhit

Selecting protein sequences siginificant to metabolic processes and others

Next 2940 nontoxinsequences were manually assembled from NCBI,

Feature Extractiontwelve physicochemicalproperties have been employed to describe each protein

• Including include Hydrophobicity, Contact Features,Absolute Entropy, Hydration Potential, Isoelectric point, Net Charge, Normalisedflexibility parameters, Relative Mutability, Side chain Oriental Preference,Occurence frequency, PkARcooh,and Polarity

ith feature in the feature vector of jth protein sequence, for i = 1, 2, ...,12 is given by, Fj(i) = Σ(prpk(i) * Nk)/N, where

• prpk(i) : ith property of the kth aminoacid,∀ k=1, 2, ..., 20• Nk : number of kth aminoacid residue in the sequence• N : length of the sequence

dipeptides and tripeptides composition; to reduce the dimensionality of feature space, amino acids grouped according to properties into 11 groups:

• FWY, R, K, DE, H, M, QN, ST,C, and AGILVP

LIBSVM toolsvmtrain:

for preparing models

(classifiers) trained from training sets

svmpredict:that predicts the

class of the test or experimental

samplesSteps followed before applying svmtrain module:

• checkdata.py from the tools folder in the package to check if the data intances are in acceptable format.• Application of subset.py from the tools folder to subset the data instances into 80% and remaining 20%, training and testing modules• Scale the data, using svmscale• Application of grid.py from the tools folder again for selection of optimalparameter values to the kernel function and parameter, C

The values for g and C were incremented stepwise(step 1) through a combination of :

powers of 2 from -11 through to +3 for g, andpowers of 2 from -9 to +5 for C using the tool

grid.py, which used 5fold cross validation accuracy to select the optimal parameter set.

LIBSVM also provides a tool fselect.py to remove possible redundant features from original feature set.

fselect.py ranks the features by assigning them a Fscore value.Higher the value, more significant is the feature in prediction of classes.

Performance Evaluation

· Accuracy = (TP + TN)/(TP +TN + FP + FN)· Balanced Accuracy, BAC = (Specificity + Sensitivity)/2 , where◦ Specificity = TP/(TP + FP)◦ Sensitivity = TP/(TP + FN)· AUC : area under the curve of sensitivity against (1specificity)· Matthew's correlation coefficient[1],MCC = (TP*TN – FP*FN)/((TN+FN)*(TN+FP)*(TP+FP)*(TP+FN))^(1/2)

Conclusion

•92.27% average accuracy and 0.998 area under curve (AUC) values were obtained when all the features (298) were utilized whereas ,•91.16% accuracy and 0.94 AUC were achieved with an optimized set of 114 features (supplementary file 2). •Much higher accuracies were achieved (98.13% and 97.92% for 298 and 114 features, respectively) when an absolutely separate test set consisting of 39toxins and 390 non-toxins (1:10 ratio) were used to test.

Result

The top features can be studied to identify the important functionalities of the toxic proteins.

Effective in identifying the bacterial toxins, not being computationally intensive at the same time.

Thank You

svm dbeth

Education