expectation maximization and gibbs sampling

51
6.096 – Algorithms for Computational Biology Expectation Maximization and Gibbs Sampling Lecture 1 - Introduction Lecture 2- Hashing and BLAST Lecture 3- Combinatorial Motif Finding Lecture 4 - Statistical Motif Finding

Upload: others

Post on 19-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expectation Maximization and Gibbs Sampling

6.096 – Algorithms for Computational Biology

Expectation Maximizationand Gibbs Sampling

Lecture 1 - IntroductionLecture 2 - Hashing and BLASTLecture 3 - Combinatorial Motif FindingLecture 4 - Statistical Motif Finding

Page 2: Expectation Maximization and Gibbs Sampling

Challenges in Computational Biology

4 Genome Assembly

1 Gene Finding5 Regulatory motif discoveryDNA

3 Database lookup

Sequence alignment

Evolutionary Theory7TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Comparative Genomics6

2

Gene expression analysis8

RNA transcriptCluster discovery9 Gibbs sampling10

11 Protein network analysis

12 Regulatory network inference

13 Emerging network properties

Page 3: Expectation Maximization and Gibbs Sampling

Challenges in Computational Biology

5 Regulatory motif discoveryDNA

Group of co-regulated genes

Common subsequence

Page 4: Expectation Maximization and Gibbs Sampling

OverviewIntroduction

Bio review: Where do ambiguities come from?Computational formulation of the problem

Combinatorial solutionsExhaustive searchGreedy motif clusteringWordlets and motif refinement

Probabilistic solutionsExpectation maximizationGibbs sampling

Page 5: Expectation Maximization and Gibbs Sampling

Sequence Motifs

• what is a sequence motif ?– a sequence pattern of biological significance

• examples– protein binding sites in DNA– protein sequences corresponding to common

functions or conserved pieces of structure

Page 6: Expectation Maximization and Gibbs Sampling

Motifs and Profile Matrices• given a set of aligned sequences, it is

straightforward to construct a profile matrix characterizing a motif of interest

sequence positionsshared motif

1 2 3 4 5 6 7 8

0.1

0.1

0.6

0.2

A 0.1

0.5

0.2

0.2 0.3

0.2

0.2

0.3 0.2

0.1

0.5

0.2 0.1

0.1

0.6

0.2

0.3

0.2

0.1

0.4

0.1

0.1

0.7

0.1

0.3

0.2

0.2

0.3

C

G

T

Page 7: Expectation Maximization and Gibbs Sampling

Motifs and Profile Matrices• how can we construct the profile if the sequences

aren’t aligned?– in the typical case we don’t know what the motif

looks like

• use an Expectation Maximization (EM) algorithm

Page 8: Expectation Maximization and Gibbs Sampling

The EM Approach• EM is a family of algorithms for learning

probabilistic models in problems that involve hidden state

• in our problem, the hidden state is where the motif starts in each training sequence

Page 9: Expectation Maximization and Gibbs Sampling

The MEME Algorithm

• Bailey & Elkan, 1993• uses EM algorithm to find multiple motifs in a set

of sequences• first EM approach to motif discovery: Lawrence &

Reilly 1990

Page 10: Expectation Maximization and Gibbs Sampling

Representing Motifs

• a motif is assumed to have a fixed width, W• a motif is represented by a matrix of

probabilities: represents the probability of character c in column k

• example: DNA motif with W=3

ckp

1 2 3A 0.1 0.5 0.2C 0.4 0.2 0.1G 0.3 0.1 0.6T 0.2 0.2 0.1

=p

Page 11: Expectation Maximization and Gibbs Sampling

Representing Motifs

• we will also represent the “background” (i.e. outside the motif) probability of each character

• represents the probability of character c in the background

• example:

0cp

A 0.26C 0.24G 0.23 T 0.27

=0p

Page 12: Expectation Maximization and Gibbs Sampling

Basic EM Approach• the element of the matrix represents the

probability that the motif starts in position j in sequence I

• example: given 4 DNA sequences of length 6, where W=3

Z

1 2 3 4seq1 0.1 0.1 0.2 0.6seq2 0.4 0.2 0.1 0.3seq3 0.3 0.1 0.5 0.1seq4 0.1 0.5 0.1 0.3

=Z

ijZ

Page 13: Expectation Maximization and Gibbs Sampling

Basic EM Approach

given: length parameter W, training set of sequencesset initial values for pdo

re-estimate Z from p (E –step)re-estimate p from Z (M-step)

until change in p < εreturn: p, Z

Page 14: Expectation Maximization and Gibbs Sampling

Basic EM Approach• we’ll need to calculate the probability of a training

sequence given a hypothesized starting position:

∏∏∏+=

−+

=+−

=

==L

Wjkc

Wj

jkjkc

j

kciji kkk

ppppZX 0,

1

1,

1

10,),1|Pr(

before motif motif after motif

kcijZiX is the ith sequence

is 1 if motif starts at position j in sequence i

is the character at position k in sequence i

Page 15: Expectation Maximization and Gibbs Sampling

Example

G C T G T A G=iX

0 1 2 3A 0.25 0.1 0.5 0.2C 0.25 0.4 0.2 0.1G 0.25 0.3 0.1 0.6T 0.25 0.2 0.2 0.1

=p

0.25 0.251.01.02.00.25 0.25

),1|Pr(

G,0A,0T,3G,2T,1C,0G,0

3

××××××

=××××××==

ppppppppZX ii

Page 16: Expectation Maximization and Gibbs Sampling

The E-step: Estimating Z

∑+−

=

==

=== 1

1

)(

)()(

)1Pr(),1|Pr(

)1Pr(),1|Pr(WL

kik

tiki

ijt

ijitij

ZpZX

ZpZXZ

• to estimate the starting positions in Z at step t

• this comes from Bayes’ rule applied to

),|1Pr( )(tiij pXZ =

Page 17: Expectation Maximization and Gibbs Sampling

The E-step: Estimating Z• assume that it is equally likely that the motif will

start in any position

∑+−

=

==

=== 1

1

)(

)()(

)1Pr(),1|Pr(

)1Pr(),1|Pr(WL

kik

tiki

ijt

ijitij

ZpZX

ZpZXZ

Page 18: Expectation Maximization and Gibbs Sampling

Example: Estimating Z

0 1 2 3A 0.25 0.1 0.5 0.2C 0.25 0.4 0.2 0.1G 0.25 0.3 0.1 0.6T 0.25 0.2 0.2 0.1

=p

G C T G T A G=iX

25.025.025.025.01.02.03.01 ××××××=iZ

25.025.025.06.02.04.025.02 ××××××=iZ

11

1=∑

+−

=

WL

jijZ

...• then normalize so that

Page 19: Expectation Maximization and Gibbs Sampling

The M-step: Estimating p

∑ ++

=+

bkbkb

kckctkc dn

dnp

)( ,,

,,)1(,

⎪⎪⎩

⎪⎪⎨

=−

>

=∑

∑ ∑

=

=−+

W

jjcc

i cXjij

kc

knn

kZ

nkji

1,

}|{

,

0

0 1,

pseudo-counts

total # of c’sin data set

• recall represents the probability of character c in position k ; values for position 0 represent the background

kcp ,

Page 20: Expectation Maximization and Gibbs Sampling

Example: Estimating pA C A G C A

1.0,1.0,7.0,1.0 4,13,12,11,1 ==== ZZZZ

A G G C A G4.0,1.0,1.0,4.0 4,23,22,21,2 ==== ZZZZ

T C A G T C1.0,1.0,6.0,2.0 4,33,32,31,3 ==== ZZZZ

4 ... 1

4,33,32,11,1

3,31,23,11,1A,1 ++++

++++=

ZZZZZZZZ

p

Page 21: Expectation Maximization and Gibbs Sampling

The EM Algorithm

• EM converges to a local maximum in the likelihood of the data given the model:

∏i

i pX )|Pr(

• usually converges in a small number of iterations• sensitive to initial starting point (i.e. values in p)

Page 22: Expectation Maximization and Gibbs Sampling

OverviewIntroduction

Bio review: Where do ambiguities come from?Computational formulation of the problem

Combinatorial solutionsExhaustive searchGreedy motif clusteringWordlets and motif refinement

Probabilistic solutionsExpectation maximizationMEME extensionsGibbs sampling

Page 23: Expectation Maximization and Gibbs Sampling

MEME Enhancements to the Basic EM Approach

• MEME builds on the basic EM approach in the following ways:– trying many starting points– not assuming that there is exactly one motif

occurrence in every sequence – allowing multiple motifs to be learned– incorporating Dirichlet prior distributions

Page 24: Expectation Maximization and Gibbs Sampling

Starting Points in MEME

• for every distinct subsequence of length W in the training set– derive an initial p matrix from this subsequence– run EM for 1 iteration

• choose motif model (i.e. p matrix) with highest likelihood

• run EM to convergence

Page 25: Expectation Maximization and Gibbs Sampling

Using Subsequences as Starting Points for EM

• set values corresponding to letters in the subsequence to X

• set other values to (1-X)/(M-1) where M is the length of the alphabet

• example: for the subsequence TAT with X=0.51 2 3

A 0.17 0.5 0.17C 0.17 0.17 0.17G 0.17 0.17 0.17T 0.5 0.17 0.5

=p

Page 26: Expectation Maximization and Gibbs Sampling

The ZOOPS Model

• the approach as we’ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model

• the ZOOPS model assumes zero or one occurrences per sequence

Page 27: Expectation Maximization and Gibbs Sampling

E-step in the ZOOPS Model• we need to consider another alternative: the ith sequence

doesn’t contain the motif• we add another parameter (and its relative)

prior prob that any position in a sequence is the start of a motif

prior prob of a sequence containing a motif

λ

λγ )1( +−= WL

Page 28: Expectation Maximization and Gibbs Sampling

E-step in the ZOOPS Model

∑+−

=

=+−=

== 1

1

)()()()(

)()()(

),1|Pr()1)(,0|Pr(

),1|Pr(WL

k

ttiki

ttii

ttijit

ij

pZXpQX

pZXZ

λγ

λ

∑+−

=

=1

1,

WL

jjii ZQ

• here is a random variable that takes on 0 to indicate that the sequence doesn’t contain a motif occurrence

iQ

Page 29: Expectation Maximization and Gibbs Sampling

M-step in the ZOOPS Model

• update p same as before• update as follows

∑∑= =

++

+−=

+−=

n

i

m

j

tji

tt Z

WLnWL 1 1

)(,

)1()1(

)1(1

)1(γλ

γλ ,

• average of across all sequences, positions)(,tjiZ

Page 30: Expectation Maximization and Gibbs Sampling

The TCM Model• the TCM (two-component mixture model)

assumes zero or more motif occurrences per sequence

Page 31: Expectation Maximization and Gibbs Sampling

Likelihood in the TCM Model• the TCM model treats each length W subsequence

independently• to determine the likelihood of such a subsequence:

∏−+

=+−==

1

1,),1|Pr(Wj

jkjkcijij k

ppZX assuming a motifstarts there

∏−+

=

==1

0,),0|Pr(Wj

jkcijij k

ppZXassuming a motifdoesn’t start there

Page 32: Expectation Maximization and Gibbs Sampling

E-step in the TCM Model

)()(,

)()(,

)()(,)(

),1|Pr()1)(,0|Pr(),1|Pr(

ttijji

ttijji

ttijjit

ij pZXpZXpZX

Zλλ

λ=+−=

==

subsequence is a motifsubsequence isn’t a motif

• M-step same as before

Page 33: Expectation Maximization and Gibbs Sampling

Finding Multiple Motifs• basic idea: discount the likelihood that a new

motif starts in a given position if this motif would overlap with a previously learned one

• when re-estimating , multiply by ijZ )1Pr( =ijV

⎩⎨⎧

= −+

otherwise 0,

],...,[in motifs previous no 1, 1,, wjijiij

XXV

• is estimated using values from previous passes of motif finding

ijV ijZ

Page 34: Expectation Maximization and Gibbs Sampling

OverviewIntroduction

Bio review: Where do ambiguities come from?Computational formulation of the problem

Combinatorial solutionsExhaustive searchGreedy motif clusteringWordlets and motif refinement

Probabilistic solutionsExpectation maximizationMEME extensionsGibbs sampling

Page 35: Expectation Maximization and Gibbs Sampling

Gibbs Sampling

• a general procedure for sampling from the joint distribution of a set of random variables by iteratively sampling from for each j

• application to motif finding: Lawrence et al. 1993• can view it as a stochastic analog of EM for this task• less susceptible to local minima than EM

) ..., ...|Pr( 111 njjj UUUUU +−

) ...Pr( 1 nUU

Page 36: Expectation Maximization and Gibbs Sampling

Gibbs Sampling Approach

• in the EM approach we maintained a distribution over the possible motif starting points for each sequence

• in the Gibbs sampling approach, we’ll maintain a specific starting point for each sequence but we’ll keep resampling these

iZ

ia

Page 37: Expectation Maximization and Gibbs Sampling

Gibbs Sampling Approachgiven: length parameter W, training set of sequences

choose random positions for ado

pick a sequenceestimate p given current motif positions a (update step)

(using all sequences but )sample a new motif position for (sampling step)

until convergencereturn: p, a

iX

iXiXia

Page 38: Expectation Maximization and Gibbs Sampling

Sampling New Motif Positions

• for each possible starting position, , compute a weight

• randomly select a new starting position according to these weights

∏−+

=

−+

=+−

= 1

0,

1

1,

Wj

jkc

Wj

jkjkc

j

k

k

p

pA

jai =

ia

Page 39: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)

• Given: – x1, …, xN, – motif length K,– background B,

• Find:– Model M– Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

∑∑= = +

+N

i

K

ki

ka

ika

i

i

xBxkM

1 1 )(),(

log

Page 40: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE

Algorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to

probability the location is a motif occurrence

Page 41: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)

Initialization:• Select random locations a1,…, aN in x1, …, xN

• For these locations, compute M:

∑=

+ ==N

ikakj jx

NM

i1

)(1

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total

Page 42: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)

Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model:))((

)1(1

,1∑

≠=+ =+

+−=

N

isskajkj jx

BNM

M

where βj are pseudocounts to avoid 0s,and B = Σj βj

Page 43: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)Sampling:For every K-long word xj,…,xj+k-1 in x:

Qj = Prob[ word | motif ] = M(1,xj)×…×M(k,xj+k-1)Pi = Prob[ word | background ] B(xj)×…×B(xj+k-1)

Let

Sample a random new position ai according to the probabilities A1,…, A|x|-k+1.

∑+−

=

= 1||

1/

/kx

jjj

jjj

PQ

PQA

Prob

0 |x|

Page 44: Expectation Maximization and Gibbs Sampling

Gibbs Sampling (AlignACE)

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs

Page 45: Expectation Maximization and Gibbs Sampling

Advantages / Disadvantages• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space

Page 46: Expectation Maximization and Gibbs Sampling

Repeats, and a Better Background Model

• Repeat DNA can be confused as motif– Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi∈{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

Page 47: Expectation Maximization and Gibbs Sampling

Example Application: Motifs in Yeast

Group:

Tavazoie et al. 1999, G. Church’s lab, Harvard

Data:

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)

• 15 time points across two cell cycles

Page 48: Expectation Maximization and Gibbs Sampling

Processing of Data

1. Selection of 3,000 genes

• Genes with most variable expression were selected

• Clustering according to common expression

• K-means clustering• 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

1. AlignACE motif finding • 600-long upstream regions• 50 regions/trial

Page 49: Expectation Maximization and Gibbs Sampling

Motifs in Periodic Clusters

Page 50: Expectation Maximization and Gibbs Sampling

Motifs in Non-periodic Clusters

Page 51: Expectation Maximization and Gibbs Sampling

OverviewIntroduction

Bio review: Where do ambiguities come from?Computational formulation of the problem

Combinatorial solutionsExhaustive searchGreedy motif clusteringWordlets and motif refinement

Probabilistic solutionsExpectation maximizationMEME extensionsGibbs sampling