gibbs sampling - dtu health tech · 2011. 10. 27. · gibbs sampling a special kind of monte carlo...
TRANSCRIPT
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 1
Gibbs sampling
Massimo Andreatta Center for Biological Sequence Analysis
Technical University of Denmark [email protected]
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 2
Monte Carlo simulations
MC methods use repeated random sampling to numerically approximate solutions to problems
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 3
Monte Carlo simulations
A simple example: computing π with sampling
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 4
Monte Carlo simulations
A simple example: computing π with sampling
€
Ac = πr2
r
€
As = 2r( )2
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 5
Monte Carlo simulations
A simple example: computing π with sampling
€
Ac = πr2
r
€
As = 2r( )2
€
AcAs
=πr2
4r2=π4
€
π = 4 AcAs
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 6
Monte Carlo simulations
A simple example: computing π with sampling
€
π = 4 AcAs
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 7
X
Monte Carlo simulations
A simple example: computing π with sampling
€
π = 4 AcAs
X
X
Throw darts randomly
€
hit circlehit square
=hit
hit +miss=AcAs
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 8
Monte Carlo simulations
A simple example: computing π with sampling
X
€
π = 4 AcAs
hit=0for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x2+y2)
if (dist
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 9
Monte Carlo simulations
A simple example: computing π with sampling
X
hit=0for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x2+y2)
if (dist
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 10
Monte Carlo simulations
A simple example: computing π with sampling
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 11
Monte Carlo simulations
A simple example: computing π with sampling
- More iterations more accurate estimate - After 1,000,000 iterations I got pi ≈ 3,14182...
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 12
Gibbs sampling
A special kind of Monte Carlo method (Markov Chain Monte Carlo, or MCMC) - estimates a distribution by sampling from it - the samples are taken with pseudo-random steps - stepping to the next state only depends on the current state (memory-less chain)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 13
Gibbs sampling
Stochastic search
Z
f(Z)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 14
Gibbs sampling
Stochastic search
Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search
€
P =min 1,exp dET
€
dE = f (Zi) − f (Zi−1)
Z
f(Z)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 15
Gibbs sampling - down to biology
Sequence alignment
Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search
€
P =min 1,exp dET
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
€
dE = f (Zi) − f (Zi−1)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 16
Gibbs sampling - down to biology
Sequence alignment
Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search
€
P =min 1,exp dET
€
dE = Ei − Ei−1
€
E = Cp,ap,a∑ log
pp,aqa
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
€
dE = f (Zi) − f (Zi−1)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 17
Gibbs sampling - sequence alignment
State transition
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
€
dE = Ei − Ei−1
€
E = Cp,ap,a∑ log
pp,aqa
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 18
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
€
P =min 1,exp dET
Accept or reject the move?
Note that the probability of going to the new state only depends on the previous state
Gibbs sampling - sequence alignment
State transition
€
dE = Ei − Ei−1
€
E = Cp,ap,a∑ log
pp,aqa
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 19
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
€
Ei = 2.52
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
Gibbs sampling - sequence alignment
Numerical example - 1
€
Ei−1 = 2.44
€
P = min 1,exp 0.080.2
= min 1 , 1.49[ ] =1 Accept move with
Prob = 100%
€
T = 0.2
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 20
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
Gibbs sampling - sequence alignment
Numerical example - 2
€
P = min 1,exp −0.090.2
= min 1 , 0.638[ ] = 0.638
€
T = 0.2
Accept move with Prob = 63.8%
€
Ei = 2.35
€
Ei−1 = 2.44
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 21
Gibbs sampling - sequence alignment
Now, one thing at a time
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 22
Gibbs sampling - sequence alignment
What is the MC temperature?
iteration
T
it’s a scalar decreased during the simulation
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 23
Gibbs sampling - sequence alignment
What is the MC temperature?
iteration
T
it’s a scalar decreased during the simulation E.g. same dE=-0.3 but at different temperatures
€
P(t1) =min 1,expdEt1
=min 1,exp
−0.30.4
= 0.47
t1=0.4
t3=0.02
€
P(t3) =min 1,exp−0.30.02
≈ 0
t2=0.1
€
P(t2) =min 1,exp−0.30.1
= 0.05
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 24
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 25
Z
f(Z)
Move freely around states when the system is “warm”, then cool it off to force it into a state of high fitness
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 26
Gibbs sampling - sequence alignment
Why sampling? SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
............
DFAAQVDYPSTGLY
50 sequences 12 amino acids long
try all possible combinations with a 9-mer overlap
450 ~ 1030 possible combinations
...computationally unfeasible
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 27
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
€
P =min 1,exp dET
Accept or reject the move?
Single sequence move
€
dE = Ei − Ei−1
€
E = Cp,ap,a∑ log
pp,aqa
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 28
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT
move to state +1
€
P =min 1,exp dET
Accept or reject the move?
Phase shift move
€
dE = Ei − Ei−1
€
E = Cp,ap,a∑ log
pp,aqa
shift all sequences
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 29
A sketch for the alignment algorithm
• Start from a random alignment • Set initial temperature • For N iterations
• pick a random sequence • suggest a shift move • accept or reject the move depending on • every Psh moves, attempt a phase shift move • decrease temperature
€
P =min 1,exp dET
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 30
Does it work?
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 31
Gibbs sequence alignment - performance
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 32
Aligning scoring matrices
More Gibbs sampling
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 33
Alignment of scoring matrices
4 networks trained on HLA*DRB1-0401
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 34
Alignment of scoring matrices
Combined logo
Equally valid solutions, but with different core registers
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 35
The PSSM-align algorithm
L
20
Individual PSSM
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 36
The PSSM-align algorithm
LIndividual PSSM
1. Extend matrix with BG frequencies
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 37
LAll individual PSSMs
1. Extend matrix with BG frequencies
The PSSM-align algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 38
LAll individual PSSMs
1. Extend matrix with BG frequencies
The PSSM-align algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 39
LAll individual PSSMs
1. Extend matrix with BG frequencies
2. Apply random shift
The PSSM-align algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 40
1. Extend matrix with BG frequencies
2. Apply random shift 3. Do Gibbs sampling for
many iterations core
Maximize combined Information Content of the core
The PSSM-align algorithm
€
P =min 1,exp dET
Accept moves with probability:
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 41
1. Extend matrix with BG frequencies
2. Apply random shift 3. Do Gibbs sampling for
many iterations core
Avg matrix
Offset2
-300
-803
Maximize combined Information Content of the core
The PSSM-align algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 42
Alignment of scoring matrices
after alignment before alignment
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 43
And more Gibbs sampling
Clustering peptide data
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 44
Gibbs clustering
--ELLEFHYYLSSKLNK----------LNKFISPKSVAGRFAESLHNPYPDYHWLRT-------NKVKSLRILNTRRKL-------MMGMFNMLSTVLGVS----AKSSPAYPSVLGQTI--------RHLIFCHSKKKCDELAAK-
----SLFIGLKGDIRESTV----DGEEEVQLIAAVPGK----------VFRLKGGAPIKGVTF---SFSCIAIGIITLYLG-------IDQVTIAGAKLRSLN--WIQKETLVTFKNPHAKKQDV-------KMLLDNINTPEGIIP
Cluster 2
Cluster 1 SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTIRHLIFCHSKKKCDELAAK
Multiple motifs!
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 45
Gibbs clustering - the algorithm
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 46
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------GMFNMLSTV----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----
g1 g2 gN
Gibbs clustering - the algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 47
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------GMFNMLSTV----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----
g1 g2 gN
GMFNMLSTV
3 Move sequence
Gibbs clustering - the algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 48
4b. Remove peptide from its group I
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----
g1 g2 gN
GMFNMLSTV
3 Move sequence
Gibbs clustering - the algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 49
4b. Remove peptide from its group I
5b. Score peptide to a new random group R and in its original group I
€
dE = SR − SI
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----
g1 g2 gN
GMFNMLSTV
3 Move sequence
Gibbs clustering - the algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 50
€
P =min 1,exp dET
6b. Accept or reject move
4b. Remove peptide from its group I
5b. Score peptide to a new random group R and in its original group I
€
dE = SR − SI
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA----------GMFNMLSTV-----
g1 g2 gN
GMFNMLSTV
3 Move sequence
Gibbs clustering - the algorithm
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 51
€
P =min 1,exp dET
6b. Accept or reject move
4b. Remove peptide from its group I
5b. Score peptide to a new random group R and in its original group I
€
dE = SR − SI
FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK
1. List of peptides 2. create N random groups
-----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----
-----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----
-----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA----------GMFNMLSTV-----
g1 g2 gN
GMFNMLSTV
3 Move sequence
Gibbs clustering - the algorithm
And iterate many times, gradually decreasing T
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 52
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
Does it work ?
Mixture of 100 binders for the two alleles
ATDKAAAAY A*0101EVDQTKIQY A*0101AETGSQGVY B*4402ITDITKYLY A*0101AEMKTDAAT B*4402FEIKSAKKF B*4402LSEMLNKEY A*0101GELDRWEKI B*4402LTDSSTLLV A*0101FTIDFKLKY A*0101TTTIKPVSY A*0101EEKAFSPEV B*4402AENLWVPVY B*4402
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 53
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
A0101 B4402
G 1
G 2
Mixed
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 54
Two MHC class I alleles: HLA-A*0101 and HLA-B*4402
97 3
3 97
A0101 B4402
G 1
G 2
Resolved
Mixed
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 55
Five MHC class I alleles A0101 A0201 A0301 B0702 B4402
G 0
G 1
G 2
G 3
G 4
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 56
Five MHC class I alleles A0101
0 1 76 1 0
2 4 0 0 95
5 87 5 1 0
93 2 19 0 2
0 6 0 98 3
A0201 A0301 B0702 B4402
G 0
G 1
G 2
G 3
G 4
HLA-B4402 94%
HLA-B0702 92%
HLA-A0201 89%
HLA-A0101 80%
HLA-A0301 97%
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 57
HLA-A*02:01 sub-motifs
= 10 nM = 4 hours
= 10 nM = 1.5 hours
666 peptide binders (aff < 500 nM)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 58
Splitting with Gibbs clustering
= 10 nM = 3.5 hours
= 10 nM = 2.25 hours
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 59
Gibbs clustering
And what if we don’t know a priori the number of clusters?
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 60
How many clusters?
We could run the algorithm with different number of clusters k and choose the k with highest information content
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 61
How many clusters?
We could run the algorithm with different number of clusters k and choose the k with highest information content
What’s going on ?
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 62
How many clusters?
We could run the algorithm with different number of clusters k and choose the k with highest information content
What’s going on ?
smaller groups tend to have higher information content
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 63
How many clusters?
Let’s look back at the Energy function
€
E = Cp,ap,a∑ log
pp,aqa
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 64
How many clusters?
Let’s look back at the Energy function
€
E = Cp,ap,a∑ log
pp,aqa
This is equivalent to scoring each sequence S to its matrix
€
E = logpp,aqap,a
∑S∑
L
20
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 65
How many clusters?
Let’s look back at the Energy function
€
E = Cp,ap,a∑ log
pp,aqa
This is equivalent to scoring each sequence S to its matrix
€
E = logpp,aqap,a
∑S∑
What is the problem? Overfitting. S was also used to calculate the log-odds matrix!
The contribution of S on the matrix will be larger if the cluster is small.
L
20
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 66
How many clusters?
Let’s look back at the Energy function
€
E = Cp,ap,a∑ log
pp,aqa
This is equivalent to scoring each sequence S to its matrix
€
E = logpp,aqap,a
∑S∑
What is the problem? Overfitting. S was also used to calculate the log-odds matrix!
The contribution of S on the matrix will be larger if the cluster is small.
L
20
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 67
How many clusters?
€
E = logpp,aqap,a
∑S∑
What is the problem? Overfitting. S was also used to calculate the log-odds matrix!
The contribution of S on the matrix will be larger if the cluster is small.
€
E = logpp,aS−
qap,a∑
S∑
Before scoring S, remove it and update the matrix
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 68
How many clusters?
€
E = logpp,aqap,a
∑S∑
€
E = logpp,aS−
qap,a∑
S∑
Is this so important..?
YQAFRTKVHSPRTLNAWVYALTVVWLLLSSIGIPAYAVAKCNLNHTPYDINQMLLLMMTLPSIKELENEYYFIENATFFIFAEMLASIDL...
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 69
How many clusters?
€
E = logpp,aqap,a
∑S∑
€
E = logpp,aS−
qap,a∑
S∑
Is this so important..? YES
YQAFRTKVHSPRTLNAWVYALTVVWLLLSSIGIPAYAVAKCNLNHTPYDINQMLLLMMTLPSIKELENEYYFIENATFFIFAEMLASIDL...
SCORE Num of sequences in the cluster
100 20 3
w/o removing
5.52 10.42 26.78
removing 4.11 2.57 0.05
Score YALTVVWLL to a matrix, including vs. excluding YALTVVWLL in the matrix construction
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 70
How many clusters?
Quality of clustering is not only determined by information content of individual clusters (intra-cluster distance), but also by the ability of different groups to discriminate (inter-cluster distance)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 71
How many clusters?
Quality of clustering is not only determined by information content of individual clusters (intra-cluster distance), but also by the ability of different groups to discriminate (inter-cluster distance)
€
E = logpp,aS−
qap,a∑
S∑
€
E = logpp,aS−
qp,aS
p,a∑
S∑
position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for inter-cluster distance)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 72
How many clusters?
€
E = logpp,aS−
qp,aS
p,a∑
S∑ −λn
One last thing and we are ready.
A parameter λ to modulate the ‘tightness’ of the clustering (n is the number of clusters)
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 73
How many clusters?
€
E = logpp,aS−
qp,aS
p,a∑
S∑ −λn
One last thing and we are ready.
A parameter λ to modulate the ‘tightness’ of the clustering (n is the number of clusters) position and cluster-specific background
(the background is calculated on all groups not containing S, it accounts for inter-cluster distance)
frequencies are calculated by removing the sequence being scored S
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 74
How many clusters?
!
!
!!
!!
! ! !
!! !
3.0
3.4
3.8
4.2
2 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
!
! ! ! !!
! ! !
3.2
3.4
3.6
3.8
4.0
3 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
!
!
!!
!! !
!!
3.4
3.8
4.2
4 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
! !
!
!
!
!
!! !
!
!
3.4
3.8
4.2
5 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
! !
!
!
!
!
!
!
!
3.3
3.5
3.7
3.9
6 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
!
! !
!!
!!
! !
2.8
3.2
3.6
7 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
!
!
! !
! !
! !!
2.8
3.2
3.6
8 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
!
!
!
!
!
!
!
! !
!!
!
3.0
3.4
3.8
9 alleles ! lambda=0.02
Groups
KLD
sum
2 4 6 8 10 12
Binders for 2 to 9 MHC class I alleles
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 75
How many clusters?
2 3 4 5 6 7 8
34
56
78
9
10 random allele combinations
Alleles
Nu
mb
er
of
clu
ste
rs
!
!
!
! !
! !
! !
!
!
!
!
!
! !
!
!
!
!
!
Lambda penalty
0
0.02
0.04
!
!
2 3 4 5 6 7 8
24
68
10
Lambda = 0.000
Alleles
Nu
mb
er
of
clu
ste
rs
!
!
2 3 4 5 6 7 8
24
68
10
Lambda = 0.020
Alleles
Nu
mb
er
of
clu
ste
rs
!
!
! !
!
2 3 4 5 6 7 8
23
45
67
89
Lambda = 0.040
Alleles
Nu
mb
er
of
clu
ste
rs
-
CEN
TER FO
R B
IOLO
GIC
AL SEQ
UEN
CE A
NA
LYSIS
Department of Systems Biology Technical University of Denmark 76
In conclusion
• Sampling methods can solve problems where the search space is too large to be exhaustively explored
• Gibbs sampling can detect even weak motifs in a sequence alignment (e.g. MHC class II)
• More than 1,000 papers in PubMed using Gibbs sampling methods
• Transcription start-sites • Receptor binding sites • Acceptor:Donor sites • ...