ivan dimitrov

21
Ivan Dimitrov Copyright © 1997Ivo Ivanov School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011

Upload: inez

Post on 06-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Ivan Dimitrov. School of Pharmacy Medical University of Sofia. Application of machine learning techniques for allergenicity prediction. 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ivan Dimitrov

Ivan Dimitrov

Copyright © 1997 Ivo Ivanov

School of PharmacyMedical University of Sofia

Application of machine learning techniques for allergenicity prediction

2nd Regional Conference“Supercomputing Applications in Science and Industry”Rodopi Hotel, Sunny Beach, Bulgaria,September 20-21, 2011

Page 2: Ivan Dimitrov

Allergen processing pathways

C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283

Page 3: Ivan Dimitrov

FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins

A query protein is potentially allergenic if it:

has > 35% sequence similarity over a window of 80 amino acids

has an identity of 6 to 8 contiguous amino acids

or

when compared with known allergens.

Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.

Page 4: Ivan Dimitrov

1. Sequence-alignment search of query protein

Extensive databases of known allergen proteins and the FAO/WHO guidelines- Structural Database of Allergenic Proteins - Allermatch

-High sensitivity (true positives/(true positives + false negatives))- Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens.

Bioinformatics approaches to allergen prediction

Characteristics:

Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362Fiers et al. BMC Bioinformatics 2004, 5, 133

Page 5: Ivan Dimitrov

Bioinformatics approaches to allergen prediction

- Comparing allergens to non-allergens by MEME motif discovery tool- Clustering of known allergens, wavelet analysis and hidden Markov model- Automated Selection of Allergen-Representative Peptides (DASARP).- Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP)- Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine

Both approaches are based on the assumption that the allergenicity is a linearly coded property.

2. Identification of conserved allergenicity-related linear motifs

Stadler and Stadler FASEB J. 2003, 17, 1141-1143 Saha and Raghava Nucleic Acids Research,2006,34, 202-209Li et al. Bioinformatics 2004, 20, 2572-2578. Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50

Page 6: Ivan Dimitrov

AIM of the study

To create an alignment-free method for in silico identification of allergens based on the

main chemical properties of amino acid sequences and implement it to a web server.

Obstacles:

Allergens are proteins with different length.

The choice of an appropriate descriptors to represent the physicochemical properties of amino acid

sequences.

Page 7: Ivan Dimitrov

The z-scales

z1 z2 z3 hydrophobicity molecular size polarity

…Phe – Arg – Trp…

z1 z2 z3

-4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69

z1 z2 z3 z1 z2 z3

Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135

Page 8: Ivan Dimitrov

ACC transformation

lagn

i

lagijijjj lagn

ZZlagACC ,,)(

lagn

i

lagikij

kjjk lagn

ZZlagACC ,,)(

Auto-covariance Cross-covariance

Phe – Arg – Trp – Phe – Arg – Trp protein

z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3

ACC11(1)

z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3

/5

/5 ACC13(1)

j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence;

Wold et al. Anal. Chim. Acta 1993, 277:239-225

Page 9: Ivan Dimitrov

Preliminary study

595 food allergens from CSL allergen database 595 non-allergens from NCBI database

external validation

PLS - discriminant analysisLogistic regressionNaïve - Bayes algorithmDecision tree algorithmk Nearest Neighbours

Training set 475 food allergens 475 non-allergens

Test set 120 food allergens 120 non-allergens

statistical methods, machine learning

matrix with 45 variables (32 x 5) and 950 observations

ACC transformation of z descriptors

Sensitivity

Specificity

Accuracy

http://allergen.csl.gov.ukhttp://www.ncbi.nlm.nih.gov/

Page 10: Ivan Dimitrov

Results from preliminary study

FPTN

TNyspecificit

FNTP

TPysensitivit

FNTNFPTP

TNTPaccuracy

TP – true positive, FP – false positiveTN – true negative, FN – false negative

0

10

20

30

40

50

60

70

80

90

100

PLS-DA Logisticregression

Decision tree Naïve-Bayes kNN(k=3) kNN(k=5)

Algorithm

%

Sensitivity,%

Specificity,%

Accuracy,%

Page 11: Ivan Dimitrov

Web servers on the test setAlgpred   - SVM with single aa composition - SVM with dipeptide composition

EvallerAPPELAllerhunter

Test set 120 food allergens 120 non-allergens

SensitivitySpecificityAccuracy

0

10

20

30

40

50

60

70

80

90

100

ALGPRED (svm, single aacomposition)

ALGPRED (svm, dipeptidecomposition)

EVALLER APPELL ALLERHUNTER kNN(5)

Server

%

Sensitivity,%

Specificity,%

Accuracy,%

Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research 2007, 35, 694-700

http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861

Page 12: Ivan Dimitrov

Conclusions from the preliminary study

1. The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study.

2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements.

3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.

Page 13: Ivan Dimitrov

matrix of 45 variables (32 x 5) and 950 observations

ACC transformationof z descriptors

The kNN algorithm

Training set475 allergens, 475 non-allergens

Sort the distance by value in ascending

order

Unknown

protein

Calculate the Euclidian distance between the vector and each

observation

ACC transformationof z descriptors

vector with 45 variables (32 x 5)

Determine the k

nearest neighbours

Determine the class of unknown allergen according to the

majority of nearest neighbours

Page 14: Ivan Dimitrov

Next: Extend the data sets

CSL allergen database, FARRP allergen database SDAP database, ADFS database

684 food, 1157 inhalant,553 toxins, venom or salivary allergens

NCBI database

Allergen species

Proteins from allergen species

Create local

database

Blasts search against all allergens

684 non-allergen from food origin 1157 non-allergens from inhalant origin

553 non-allergens from species with toxins, venom or salivary allergens

http://allergen.csl.gov.ukhttp://www.allergenonline.org/http://fermi.utmb.edu/SDAP/

http://allergen.nihs.go.jp/ADFS/index.jsphttp://www.ncbi.nlm.nih.gov/

Page 15: Ivan Dimitrov

Next: kNN optimization

684 food allergens684 non-allergens

Training set528 allergens

528 non-allergens

Test set156 allergens

156 non-allergens

machine learning

k nearest neighbours

external validation

SensitivitySpecificityAccuracy

50

55

60

65

70

75

80

85

90

95

100

3 5 7 9 11 13 15 17 19

k nearest neigbours

%

sensitivity

specificity

accuracy

Page 16: Ivan Dimitrov

kNN models

1157 inhalant allergens1157 non-allergens

684 food allergens684 non-allergens

Training set528 allergens

528 non-allergens

Test set156 allergens

156 non-allergens

Training set933 allergens

933 non-allergens

Test set224 allergens

224 non-allergens

k NN

k = 3

external validation

k NN

k = 3

external validation

external validation

SensitivitySpecificityAccuracy

Page 17: Ivan Dimitrov

kNN models

0

10

20

30

40

50

60

70

80

90

100

kNN, food training andtest set

kNN, food training seton inhalant test set

kNN, inhalant trainingand test set

kNN inhalant trainingset on food test set

kNN aggregated training and test set

sensitivity

specificity

accuracy

Page 18: Ivan Dimitrov

AllerTOP web tool for allergenicity prediction

Training set 1952 food, inhalant and others

allergens and 1952 non-allergens

ACC transformationof z descriptors

kNN model

external validation

AllerTOP

http://www.pharmfac.net/alletop

Page 19: Ivan Dimitrov

Servers performance on united testset

Two of the servers from preliminary studies: Appel and Evaller were not available during recent study.The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)

United test set of 441 food and inhalant allergens and 441 non-allergens

0

10

20

30

40

50

60

70

80

90

100

AllerTOP(KNN, K=3) Allerhunter AlgPred, svm aminoacid decomposition

AlgPred, svmdipeptide

decomposition

AlgPred (ARP)

sensitivity

specificity

accuracy

Page 20: Ivan Dimitrov

Conclusions

1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed.

2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors.3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm.

4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on:http://www.pharmfac.net/allertop

Page 21: Ivan Dimitrov

Drug Design Group

Irini Doytchinova Ivan DimitrovMariyana AtanasovaPanaiot Garnev

Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009

Acknowledgements

Darren R. Flower Aston University, Birmingham, UK

School of PharmacyMedical University of Sofia