t cell epitope predictions using bioinformatics (neural networks and hidden markov models)

T cell Epitope predictionsusing bioinformatics

(Neural Networks andhidden Markov models)

Morten Nielsen, CBS, BioCentrum,

Processing of intracellular proteins

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

MHC binding

What makes a peptide a potential and effective

epitope?• Part of a pathogen protein• Successful processing

– Proteasome cleavage– TAP binding

• Binds to MHC molecule• Protein function

– Early in replication• Sequence conservation in

evolution

Sars virus

From proteins to From proteins to immunogensimmunogens

Lauemøller et al., 2000

20% processed 0.5% bind MHC 50% CTL response

=> 1/2000 peptide are immunogenic

Location of class I epitopes

GP1200 proteinStructure(1GM9)

MHC class I with peptideMHC class I with peptide

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

Anchor positions

Prediction of HLA binding specificity

• Simple Motifs– Allowed/non allowed amino acids

• Extended motifs– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids

• Hidden Markov models– Peptide statistics from sequence alignment

• Neural networks– Can take sequence correlations into account

Where to get data?• SYFPEITHI database

– 3500 peptides known to bind to HMC class I and II – Only published data

• MHCpep– 13000 peptides known to bind to HMC class I and II – Published data and direct submission– No update since 1998

• Binding affinity assays– Quantitative data. How strong does a peptide bind

to the MHC molecule?– Costly and people do not publish negative results..

Databases and web resources

• HLA Informatics Group, ANRI (HLA sequence database)

• IMGT/HLA Database (HLA sequence database)• SYFPEITHI (Database of HLA Class I and II peptides)• MHCPEP (Database of HLA Class I and II peptides)• BIMAS (HLA Class I predictor)• SYFPEITHI (HLA Class I predictor)• NetMHC (HLA Class I prediction)

Sequence informationSLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence logo

• Height of a column equal to log 20 + p log p

• Relative height of a letter is p

• Highly useful tool to visualize sequence motifs

High information positions

MHC class IHLA-A0201

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Characterizing a binding motif

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

10 peptides known to bind MHC What can we learn?

1. A at P1 favors binding?

2. I is not allowed at P9? 3. K at P4 favors binding?

Sequence information

• Description of binding motif

• ExamplePA = 6/10

PG = 2/10

PT = PK = 1/10

PC = PD = …PV = 0

• Problems– Few data– Data

redundancy/duplication

Sequence information Raw sequence counting

Pseudo-count and sequence weighting

• Poor or biased sampling of sequence space

• I is not found at position P9. Does this mean that I is forbidden?

• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

}Similar sequencesWeight 1/5

The Blosum matrix

Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Pseudo counts

• Sequence weighting and pseudo count– Prediction accuracy

• Motif found on all data (485)– Prediction accuracy

Weight matrices

• Estimate amino acid frequencies from alignment

• Now a weight matrix is given as

Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid.

qj is the background frequency for amino acid j.

• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking

up and adding L values from matrix

Scoring sequences to a weight matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

ILYQVPFSVALPYWNFATMTAQWWLDA

Which peptide is most likely to bind?Which peptide second?

15.0 -3.4 0.8

How to predict• The effect on the binding affinity of

having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).

– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.

• Artificial neural networks (ANN) are ideally suited to take such correlations into account

Neural networks• Neural networks

can learn higher order correlations!– What does this

A A => 0A C => 1C A => 1C C => 0

No linear function can learn this pattern

Neural networks

w21w22

XOR(x1,x2) = (x1 + x2) − 2 ⋅ x1 ⋅ x2 = y − z

y = x1 + x2

z = 2 ⋅ x1 ⋅ x2

Evaluation of prediction accuracy

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curves

Pearson correlation

Epitope predictionsSequence motif and HMM’s

Sequence motif HMM

cc: 0.76Aroc: 0.92

cc: 0.80Aroc: 0.95

Epitope prediction. Neural Networks

cc: 0.91Aroc: 0.98

Evaluation of prediction accuracy

MotifHmm ANN

PearAroc

Hepatitis C virus. Epitope predictions

Proteasomal cleavage

• Netchop (http://www.cbs.dtu.dk/services/NetChop-3.0/)

– Epitopes have strong C terminal cleavage– Epitopes can have strong internal cleavage

• Selection strategy– High binding peptides– High cleavage probability at C terminal

NMVPFFPPV..S.....S

Hvad nu?

• 29 marts. Introduktion til hidden Markov models og weight matrices

• 5 april. Introduktion til neural networks

• 12 april. Introduktion til projekt• 10 maj. Aflever projekt

t cell epitope predictions using bioinformatics (neural networks and hidden markov models)

Documents

hidden markov modelbinma/cs482/09-hmm.pdf · hmm •hidden...

hidden markov models in bioinformatics example domain: gene...

hidden markov model ka-lok ng dept. of bioinformatics asia...

bioinformatics 1 -- lecture 23 - purdue...

hidden markov models -...

applying hidden markov models to bioinformatics

hidden markov models in bioinformatics by srikanth bala

bioinformatics introduction to hidden markov models -...

bioinformatics: practical application of simulation and data...

markov chains and hidden markov models = stochastic,...

hidden markov models ellen walker bioinformatics hiram...

bioinformatics: biology x4bio… · hidden markov models b...

hidden markov models in...

introduction to bioinformatics: lecture xiii profile and...

hidden markov models in bioinformaticshein/hmm.pdf ·...

immunological bioinformatics - cbs · prediction of the ctl...

stochastic processes and hidden markov models · hidden...

bioinformatics hidden markov models. markov random processes...

hidden markov modelsigcf/tabc/hmm.pdf · an introduction to...

cisc 636 computational biology & bioinformatics (fall 2016...