hidden markov models for genetic linkage analysis lecture #4 prepared by dan geiger

27
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 repared by Dan Geiger.

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

.

Hidden Markov Models For Genetic Linkage Analysis

Lecture #4

Prepared by Dan Geiger.

Page 2: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

2

Hidden Markov Models in General

X1 X2 X3 Xi-1 Xi Xi+1R1 R2 R3 Ri-1 Ri Ri+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

),...,( 2

11111

1

)|()|()|()(),,(Lss

ii

L

iiiL srpsspsrpspssp

Which depicts the factorization:

Page 3: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

3

Hidden Markov Model In our case

X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Yi-1 Xi Xi+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

The compounded variable Si = (Si,1,…,Si,2n) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,…,Xi,2n) is the data regarding locus i. Similarly for the disease locus we use Yi.

To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.

Page 4: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

4

Queries of interest (MAP)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

The Maximum A Posteriori query : ),,|,...,(argmax ),...,( 11

),...,(h

**1

1

LLh

L xxhhphhL

An efficient solution, assuming local probability tables (“the parameters”) are known, is called the Viterbi Algorithm.

Same problem if replaced by maximizing the joint distribution p(h1,…,hL,x1,..,xL)

An answer to this query gives the most probable inheritance vector for all locations.

Page 5: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

5

Queries of interest (Belief Update) Posterior Decoding

1. Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} for each of Hi’s values hi, namely, compute p(hi | x1,…,xL).

2. Do the same computation for every Hi but without repeating the first task L times.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Local probability tables are assumed to be known. An answer to this query gives the probability distribution of inheritance vectors at an arbitrary location.

Page 6: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

6

Decomposing the computation of Belief update (Posterior decoding)

P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | x1,…,xi,hi)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Equality due to Ind({xi+1,…,xL}, {x1,…,xi} | Hi}

= P(x1,…,xi,hi) P(xi+1,…,xL | hi) f(hi) b(hi)

Belief update: P(hi | x1,…,xL) = (1/K) P(x1,…,xL,hi)

where K= hi P(x1,…,xL,hi).

Page 7: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

7

The forward algorithm

P(x1,x2,h2) = P(x1,h1,h2,x2) {Second step}

= P(x1,h1) P(h2 | x1,h1) P(x2 | x1,h1,h2)

h1

h1Last equality due to conditional independence = P(x1,h1) P(h2 | h1) P(x2 | h2)h1

H1 H2

X1 X2

Hi

Xi

The task: Compute f(hi) = P(x1,…,xi,hi) for i=1,…,L (namely, considering evidence up to time slot i).

P(x1, h1) = P(h1) P(x1|h1) {Basis step}

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

{step i}

Page 8: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

8

The backward algorithm

The task: Compute b(hi) = P(xi+1,…,xL|hi) for i=L-1,…,1 (namely, considering evidence after time slot i).

HL-1 HL

XL-1 XL

Hi Hi+1

Xi+1

P(xL| hL-1) = P(xL ,hL |hL-1) = P(hL |hL-1) P(xL |hL-1 ,hL )= hL hL

Last equality due to conditional independence

= P(hL |hL-1) P(xL |hL ) {first step}

hL

P(xi+1,…,xL|hi) = P(hi+1 | hi) P(xi+1 | hi+1) P(xi+2,…,xL| hi+1)

hi+1

{step i}

=b(hi)= =b(hi+1)=

Page 9: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

9

The combined answer

1. To Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} run the forward algorithm and compute f(hi) = P(x1,…,xi,hi), run the backward algorithm to compute b(hi) = P(xi+1,…,xL|hi), the product f(hi)b(hi) is the answer (for every possible value hi).

2. To Compute the posteriori belief for every Hi

simply run the forward and backward algorithms once, storing f(hi) and b(hi) for every i (and value hi). Compute f(hi)b(hi) for every i.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Page 10: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

10

Consequence I: Likelihood of evidence

1. To compute the likelihood of evidence P(x1,…,xL), do one more step in the forward algorithm, namely,

f(hL) = P(x1,…,xL,hL)

2. Alternatively, do one more step in the backward algorithm, namely,

b(h1) P(h1) P(x1|h1) = P(x2,…,xL|h1) P(h1) P(x1|h1)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

hL

h1

hL

h1

Evaluate likelihood of evidence for all locations of the disease markerAnd choose the best.

Page 11: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

11

Consequence II: The E-step

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Recall that belief update has been computed via P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | hi) f(hi) b(hi)

Now we wish to compute (for the E-step)p(x1,…,xL,hi ,hi+1)=

= f(hi) p(hi+1|hi) p(xi+1| hi+1) b(hi+1)

p(x1,…,xi,hi) p(hi+1|hi)p(xi+1| hi+1)p(xi+2,…,xL |hi+1)

Page 12: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

12

The EM algorithm for finding a local maximum

H1 H2 Hi+1 HL

X1 X2 Yi+1 XL

Hi

Xi

The Expectation Maximization algorithm

Iterate the two steps:E-step: compute p(x1,…,xL,hi ,hi+1) where i+1 is the disease locusM-step:

Until convergences.Comment: use random restarts, other convergence criteria, other ending schemes

)1,1,,...,()0,0,,...,()0,1,,...,()1,0,,...,(

)0,1,,...,()1,0,,...,(

1111

1111

LLLL

iiLiiL

xxpxxpxxpxxp

HHxxpHHxxp

Page 13: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

13

Time and Space Complexity of the forward/backward algorithms

Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k2L) where k is the maximum domain size of each variable.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Space complexity is also O(k2L).

Page 14: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

14

The MAP query in HMM

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1. Recall that the query asking likelihood of

evidence is to compute P(x1,…,xL) = P(x1,…,xL, h1,…,hL)

2. Now we wish to compute a similar quantity:

P*(x1,…,xL) = MAX P(x1,…,xL, h1,…,hL)(h1,…,hL)

(h1,…,hL)

And, of course, we wish to find a MAP assignment (h1

*,…,hL*) that brought about this maximum.

Page 15: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

15

Example: Revisiting likelihood of evidence

H1 H2

X1 X2

H3

X3

P(x1,x2,x3) = P(h1)P(x1|h1) P(h2|h1)P(x2|h2) P(h3 |h2)P(x3|h3)h3h2h1

= P(h1)P(x1|h1) b(h2) P(h1|h2)P(x2|h2)h1 h2

= b(h1) P(h1)P(x1|h1)h1

Page 16: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

16

Example: Computing the MAP assignment

H1 H2

X1 X2

H3

X3

maximum = max P(h1)P(x1|h1) max P(h2|h1)P(x2|h2) max P(h3 |h2)P(x3|h3)

h3h2h1

= max P(h1)P(x1|h1) max b (h2) P(h1|h2)P(x2|h2)h1 h2h3

Replace sums with taking maximum:

= max b (h1) P(h1)P(x1|h1)

h1 h2{Finding the maximum}

h1* = arg max b (h1) P(h1)P(x1|

h1)h1

h2{Finding the map assignment}

h2* = x* (h1

*); h2

x* (h2)h3

x* (h1)h2

h3* = x* (h2

*)h3

Page 17: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

17

Viterbi’s algorithm

For i=1 to L-1 do

h1* = ARG MAX P(h1) P(x1|h1) b (h1)

h2h2

hi+1* = x* (hi *)hi+1

Forward phase (Tracing the MAP assignment) :

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

x* (hi) = ARGMAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

For i=L-1 downto 1 dob (hi) = MAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

hi+1hi+1 hi+2

b (hL) = 1hL+1

hi+1hi+1 hi+2

Backward phase:

(Storing the best value as a function of the parent’s values)

Page 18: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

18

Summary of HMM

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1. Belief update = posterior decoding• Forward-Backward algorithm

2. Maximum A Posteriori assignment• Viterbi algorithm

Page 19: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

19

The forward algorithm for genetic linkage analysis

H1 H2

X1 X2

Hi

Xi

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

{step i}

Note that in Step i of the forward algorithm, we multiply a transition matrix of size 22n x 22n with vectors of size 22n.

The transition matrix P(hi | hi-1 ) has a special form so we can multiply it by a vector faster than for arbitrary matrices.

The vector P(xi | hi) is not given explicitly, so we will see an efficient way to compute it.

Page 20: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

20

The transition matrix

{m,f}tMssPtt

tttttt

where

1

1)(),|( 21323

Recall that:

Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4 4 = 22n 22n, encoding four options of recombination/non-recombination for the two parental meiosis:

mm

mmf

mm

mmf

mm

mmf

mm

mmf

fmfmfm ssssP

1

11

1

1

1

1

1

11

),,,|,( 13132323

)()( fm MM (The Kronecker product) fmhen :w

)()()(... n times ...)()( 2 MMMMM nfmfm

For n non-founders, the transition matrix is the n-fold Kronecker product:

Page 21: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

21

Efficient Product

1

11

1

1

1

1

1

11

],[ 21 xx

AxAx

AxAxxx

22

11

21 1

1],[

So, if we start with a matrix of size 22n, we will need 22n

multiplications if we had matrix A in hands. Continuing recursively, at most 2n times, yields a complexity of O(2n22n),

far less than O(24n) needed for regular multiplication.

With n=10 non-founders, we drop from non-feasible region to feasible one.

Page 22: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

22

Probability of data in one marker locus given an inheritance vector

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

P(x21, x22 , x23 |s23m,s23f) =

= P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)

P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)

l21m,l21f,l22m,l22f l22m,l22f

The five last terms are always zero-or-one, namely, indicator functions.

Page 23: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

23

Efficient computation

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

Assume only individual 3 is genotyped. For the inheritance vector (0,1),

the founder alleles L21m and L22f are not restricted by the data while

(L21f,L22m) have two possible joint assignments (A1,A2) or (A2,A1) only:

={A1,A2}

=1=0

p(x21, x22 , x23 |s23m=1,s23f =0) = p(A1)p(A2) +

p(A2)p(A1)In general. Every inheritance vector defines a subgraph of the Bayesian network above. We build a founder graph

Page 24: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

24

Efficient computation

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2 ={A1,A2}

=1=0

In general. Every inheritance vector defines a subgraph as indicated by the black lines above. Construct a founder graph whose vertices are the founder variables and where there is an edge between two vertices if they have a common typed descendent. The label of an edge is the constraint dictated by the common typed descendent. Now find all consistent assignments for every connected component.

L21fL21m L22fL22m

{A1,A2}

Page 25: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

25

A Larger Example43 65

21 87

a,b a,b

a,b a,b a,c b,d

5 3 6 4

2 1 8 7

{a,b}5 3 6 4

{a,b}

{a,b}

{b,d} {a,c}

Descent graph

Founder graph (An example of a constraint satisfaction graph)

Connect two nodes if they have a common typed descendant.

Page 26: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

26

The Constraint Satisfaction Problem

5 3 6 4

2 1 8 7

{a,b}5 3 6 4

{a,b}

{a,b}

{b,d} {a,c}

The number of possible consistent alleles per non-isolated node is 0, 1 or 2. For example node 2 has all possible alleles, node 6 can only be b and node 3 can be assigned either a or b. namely, the intersection of its adjacent edges labels.

For each non-singleton connected component:Start with an arbitrary node, pick one of its values. This dictates all other values in the component. Repeat with the other value if it has one.

So each non-singleton component yields at most two solutions.

Page 27: Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger

27

Solution of the CSP

5 3 6 4

2 1 8 7

{a,b}5 3 6 4

{a,b}

{a,b}

{b,d} {a,c}

Since each non-singleton component yields at most two solutions.The likelihood is simply the product of sums each of two terms at most. Each component contributes one term. Singleton components contribute the term 1

In our example: 1 * [ p(a)p(b)p(a) + p(b)p(a)p(b)] * p(d)p(b)p(a)p(c).

Complexity. Building the founder graph: O(f2+n).

While solving general CSPs is NP-hard.