sequence motifs, information content, logos, and weight matrices morten nielsen, cbs, biocentrum,...
TRANSCRIPT
![Page 1: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/1.jpg)
Sequence motifs, information content,
logos, and Weight matrices
Morten Nielsen,CBS, BioCentrum,
DTU
![Page 2: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/2.jpg)
Objectives
• Visualization of binding motifs• Construction of sequence logos
• Understand the concepts of weight matrix construction• One of the most important methods of
bioinformatics
![Page 3: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/3.jpg)
Outline
• Pattern recognition• Regular expressions and
probabilities
• Information content• Sequence logos• Kullback-Leibler
• Multiple alignment and sequence motifs
• Weight matrix construction• Sequence weighting• Low (pseudo) counts
• Calculate a weight matrix from a sequence alignment
• Examples from the real world
• MHC class II binding• Gibbs sampling
![Page 4: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/4.jpg)
Host/virus interaction. Sars infection
![Page 5: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/5.jpg)
Immune system
![Page 6: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/6.jpg)
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
Processing of intracellular proteins
![Page 7: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/7.jpg)
Encounter with death
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
![Page 8: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/8.jpg)
Anchor positions
Binding Motif. MHC class I with peptide
![Page 9: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/9.jpg)
SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
Sequence information
![Page 10: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/10.jpg)
Sequence Information
• Calculate pa at each position• Entropy
• Information content
• Conserved positions• PV=1, P!v=0 => S=0,
I=log(20)• Mutable positions
• Paa=1/20 => S=log(20), I=0
€
S = − pa
a
∑ log(pa )
€
I = log(20) + pa
a
∑ log(pa )
• Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
• P1: 4 questions (at most)• P2: 1 question (L or not)• P2 has the most information
![Page 11: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/11.jpg)
Information content
A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55
€
I = log(20) + pa
a
∑ log(pa )
€
S = − pa
a
∑ log(pa )
![Page 12: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/12.jpg)
Sequence logos
• Height of a column equal to I• Relative height of a letter is p• Highly useful tool to visualize sequence motifs
High information positions
HLA-A0201
![Page 13: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/13.jpg)
Kullback Leibler Logo
€
I = pa
a
∑ log(pa
qa
)
![Page 14: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/14.jpg)
Characterizing a binding motif from small data sets
What can we learn?
1. A at P1 favors binding?
2. I is not allowed at P9? 3. K at P4 favors binding?4. Which positions are
important for binding?
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
![Page 15: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/15.jpg)
Simple motifs Yes/No rules
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
€
[AGTK]1[LMIV ]2[ANLV ]3 ...[MNRTVL]9
• Only 11 of 212 peptides identified!• Need more flexible rules
•If not fit P1 but fit P2 then ok• Not all positions are equally important
•We know that P2 and P9 determines binding more than other positions
•Cannot discriminate between good and very good binders
![Page 16: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/16.jpg)
Simple motifsYes/No rules
€
[AGTK]1[LMIV ]2[ANLV ]3 ...[AIFKLV ]7...[MNRTVL]9
• Example
•Two first peptides will not fit the motif. They are all good binders (aff< 500nM)
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
![Page 17: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/17.jpg)
Extended motifs
• Fitness of aa at each position given by P(aa)
• Example P1PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
• Problems• Few data• Data redundancy/duplication
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
![Page 18: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/18.jpg)
Sequence informationRaw sequence counting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 19: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/19.jpg)
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Sequence weighting
•Poor or biased sampling of sequence space•Example P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
}Similar sequencesWeight 1/5
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
![Page 20: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/20.jpg)
Sequence weighting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 21: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/21.jpg)
Pseudo counts
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
•I is not found at position P9. Does this mean that I is forbidden (P(I)=0)?•No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
![Page 22: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/22.jpg)
A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 …. Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
What is a pseudo count?
• Say V is observed at P1• Knowing that V at P1 binds, what is the probability
that a peptide could have I at P1?• P(I|V) = 0.16
![Page 23: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/23.jpg)
• Calculate observed amino acids frequencies fa
• Pseudo frequency for amino acid b
• Example€
gb = fa
a
∑ ⋅qb |a
€
gI = 0.2 ⋅qI |M + 0.1⋅qI |N + ...+ 0.3⋅qI |V + 0.1⋅qI |L
gI = 0.2 ⋅0.04 + 0.1⋅0.01+ ...+ 0.3⋅0.18 + 0.1⋅0.17 ≈ 0.09
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Pseudo count estimation
![Page 24: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/24.jpg)
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
• Pseudo counts are important when only limited data is available
• With large data sets only “true” observation should count
• is the effective number of sequences (N-1), is the weight on prior€
pa =α ⋅ fa + β ⋅ga
α + β
![Page 25: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/25.jpg)
• Example
• If large, p ≈ f and only the observed data defines the motif
• If small, p ≈ g and the pseudo counts (or prior) defines the motif
• is [50-200] normally
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
€
pa =α ⋅ fa + β ⋅ga
α + β
![Page 26: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/26.jpg)
Sequence weighting and pseudo counts
RLLDDTPEV 84nMGLLGNVSTV 23nMALAKAAAAL 309nM
P7P and P7S > 0
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 27: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/27.jpg)
Position specific weighting
• We know that positions 2 and 9 are anchor positions for most MHC binding motifs• Increase weight on high
information positions
• Motif found on large data set
![Page 28: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/28.jpg)
Weight matrices
• Estimate amino acid frequencies from alignment including sequence weighting and pseudo count
• What do the numbers mean?• P2(V)>P2(M). Does this mean that V enables binding more than
M.• In nature not all amino acids are found equally often
• In nature V is found more often than M, so we must somehow rescale with the background
• qM = 0.025, qV = 0.073• Finding 5% V is hence not significant, but 5% M highly
significant
A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25
![Page 29: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/29.jpg)
Weight matrices• A weight matrix is given as
Wij = log(pij/qj)• where i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
![Page 30: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/30.jpg)
• Score sequences to weight matrix by looking up and adding L values from the matrix
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
Scoring a sequence to a weight matrix
RLLDDTPEVGLLGNVSTVALAKAAAAL
Which peptide is most likely to bind?Which peptide second?
11.9 14.7 4.3
84nM 23nM 309nM
![Page 31: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/31.jpg)
An example!!(See handout)
![Page 32: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/32.jpg)
fa ga pa qa wa A R N D C Q E G H I L K M F P S T W Y V
![Page 33: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/33.jpg)
A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
The Blosum matrix
€
P(A | C) = 0.07
![Page 34: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/34.jpg)
fa ga pa qa wa A 0 0.06 0.03 0.074 -2.61 R 0 0.054 0.027 0.052 -1.89 N 0 0.04 0.02 0.045 -2.34 D 0 0.082 0.041 0.054 -0.79 C 0 0.01 0.005 0.025 -4.64 Q 0.2 0.09 0.145 0.034 4.18 E 0.8 0.21 0.505 0.054 6.45 G 0 0.04 0.02 0.074 -3.78 H 0 0.03 0.015 0.026 -1.59 I 0 0.022 0.011 0.068 -5.26 L 0 0.042 0.021 0.099 -4.47 K 0 0.082 0.041 0.058 -1.00 M 0 0.012 0.006 0.025 -4.12 F 0 0.018 0.09 0.047 1.87 P 0 0.028 0.014 0.039 -2.96 S 0 0.06 0.03 0.057 -1.85 T 0 0.04 0.02 0.051 -2.70 W 0 0.01 0.005 0.013 -2.76 Y 0 0.02 0.01 0.032 -3.36 V 0 0.032 0.016 0.073 -4.38
![Page 35: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/35.jpg)
Example from real life
• 10 peptides from MHCpep database
• Bind to the MHC complex
• Relevant for immune system recognition
• Estimate sequence motif and weight matrix
• Evaluate motif “correctness” on 528 peptides
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 36: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/36.jpg)
Prediction accuracy
Pearson correlation 0.45
Prediction score
Measu
red
affi
nit
y
![Page 37: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/37.jpg)
Predictive performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Pearsons correlation
CC 0.45 0.5 0.6 0.65 0.79
Simple Seq.W Seq.W+SCSeq.W+SC+PW
Large dataset
![Page 38: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/38.jpg)
Class II MHC binding
• MHC class II binds peptides in the class II antigen presentation pathway• Binds peptides of length 9-18 (even whole proteins can bind!)• Binding cleft is open• Binding core is 9 aa
![Page 39: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/39.jpg)
Gibbs samplerwww.cbs.dtu.dk/biotools/EasyGibbs
100 10mer peptides2100~1030 combinations
€
E = Cpa logppa
qap,aa
∑
Monte Carlo simulations can do
it
![Page 40: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/40.jpg)
Gibbs sampler. Prediction accuracy
![Page 41: Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649c785503460f9492d2e3/html5/thumbnails/41.jpg)
Summary
• Weight matrices can accurately describe a sequence motif like MHC class I• Use sequence weighting to remove data
redundancy• Use pseudo count to compensate for few
data points
• Gibbs sampling can detect MHC class II binding motif