learning – em in the abo locus tutorial #8 © ilan gronau. based on original slides of ydo wexler...

18
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

Upload: ruth-phelps

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

.

Learning – EM in The ABO locusTutorial #8

© Ilan Gronau.

Based on original slides of Ydo Wexler & Dan Geiger

Page 2: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

2

Genotype statistics

Mendelian Genetics:• locus - a particular location on a chromosome (genome)

- Each locus has two copies – alleles (one paternal and one maternal)- Each copy has several relevant states - genotypes

• locus genotype is determined by the combined genotype of both copies.• locus genotype yields phenotype (physical features)

NN tsts ,,

We wish to estimate the distribution of all possible genotypes.

Suppose we randomly sample N individuals and found the

number Ns,t.

The MLE is given by: Sampling genotypes is costlySampling phenotypes is cheap

Page 3: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

3

Example: The ABO locus

• ABO locus determines blood-type

• It has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}.

• They lead to four possible phenotypes: {A, B, AB, O}

We wish to estimate the proportion in a population of the 6 genotypes.

- Sample genotype – sequence a genomic region

- Sample phenotype - checking presence of antibodies (simple

blood test)

Problem: phenotype doesn’t reveal genotype (in case of A,B)

Page 4: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

4

Example: The ABO locus

Problem: phenotype doesn’t reveal genotype

Assuming allele genotypes are distributed independently w.p: a,

b, o

we determine probabilities for locus genotypes:

• a/b=2a b ; a/o=2a o ; b/o=2b o

• a/a= a2 ; b/b=b

2 ; o/o=o2

Θ - model parameter set - Θ={a ,b ,o}

X – (hidden) genotype variable - Pr[X=x |Θ] = x

P – (observed) phenotype variable - Pr[P=p |Θ] = Σ(xp)(x)

e.g. Pr[P=A |Θ] = a/a+a/o = a2+2a o

Hardy-Weinbergequilibrium

Page 5: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

5

Example: The ABO locus

Given a population phenotype sample: Data =

{B,A,B,B,O,A,B,A,O,B, AB}

the likelihood of our parameter set Θ={a ,b ,o} is: 215232 222]|Pr[ oobaobboaaData

A B AB O

• Maximum of this function yields the MLE

Use EM to obtain this

Page 6: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

6

EM algorithmStart with some set of parameters- Θ.

Iterate until convergence:

• E-step:

calculate expectations of hidden variables implied by data and Θ

• M-step:

For every hidden variable X:

• Use expectations as statistics to yield MLE Θ’ given Θ

Hidden variables – allele genotypes

• If we knew the count of each allele genotype we could calculate MLE

Θ(={a ,b ,o})

• In the M-step we use the expected counts of allele genotypes (given Θ)

Page 7: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

7

E-step:

E[#(x)] – The expected number of counts of genotype x in the maternal allele of each locus.

If the dataset has n phenotypes: p1…pn then: #(x)=Σi (Xi=x)

By linearity of expectation: E[#(x)]= Σi (E[Xi=x])

M-step:

y ii

iiiii pyX

pxXpxXxX

],Pr[

],Pr[]|Pr[E

y

x y

x

]|)([#E

]|)([#E'

EM algorithm – ABO example

indicator

hidden genoty

pe

observed phenotyp

e

n

Page 8: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

8

• E-step: compute Pr[Xi=x , pi]

E-step calculationshidden genotype

(of paternal allele)

observed phenotyp

e

Pr[X=o , P=AB] = 0

Pr[X=a , P=AB] = a b

Pr[X=b , P=AB] = b a

Pr[X=o , P=O] = o2

Pr[X=a , P=O] = 0

Pr[X=b , P=O] = 0

Pr[X=o , P=A] = o a

Pr[X=a , P=A] = a (a +o)

Pr[X=b , P=A] = 0

Pr[X=o , P=B] = o b

Pr[X=a , P=B] = 0

Pr[X=b , P=B] = b (b +o)

0

½

½

0

ao

o

2

ao

oa

2

0bo

o

2

bo

ob

2

1

0

0

Pr[Xi=x | pi]Pr[ , ]Pr[ | ]

Pr[ , ]i i

i ii iy

X x pX x p

X y p

Pr[Xi=x | pi]

Page 9: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

9

Datatype #people

A 100

B 200

AB 50

O 50

= {a, b, o} the parameter set we need to estimate

• Our initial guess is 0 = {0.2, 0.2, 0.6}

EM algorithm – ABO example

Page 10: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

10

0= {0.2, 0.2, 0.6}

n=400 (data size)

n

iii paXaE

1

]|Pr[)(#

15.82050500200)2(

100 21

oa

oa

EM algorithm – ABO exampleData

type #people

A 100

B 200

AB 50

O 50

E-step (1st iteration):

},,,{

]|Pr[OABBApp paXn

A B AB O

Page 11: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

11

0= {0.2, 0.2, 0.6}

n=400 (data size)

15.82050500200)2(

100)]([# 21

oa

oaaE

EM algorithm – ABO exampleData

type #people

A 100

B 200

AB 50

O 50

E-step (1st iteration):

A B AB O

29.13905050)2(

2000100)]([# 21

ob

obbE

57.178150050)2(

200)2(

100)]([#

ob

o

oa

ooE

400

447.040057.178';348.0400

29.139';205.040015.82' oba

M-step (1st iteration):

Page 12: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

12

1= {0.205, 0.348, 0.447}

n=400 (data size)

33.84050500200)2(

100)]([# 21

oa

oaaE

EM algorithm – ABO exampleData

type #people

A 100

B 200

AB 50

O 50

E-step (2nd iteration):

A B AB O

02.15305050)2(

2000100)]([# 21

ob

obbE

65.162150050)2(

200)2(

100)]([#

ob

o

oa

ooE

400

406.040065.162';383.0400

02.153';211.040033.84' oba

M-step (2st iteration):

Page 13: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

13

EM algorithm – ABO example

E-step + M-step :

General update formula:

nnnn

nnn

nnn

OBob

oA

oa

oo

ABBob

obb

ABAoa

oaa

22'

2'

2'

21

21

Datatype #people

A nA

B nB

AB nAB

O nO

Page 14: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

14

EM algorithm – ABO exampleData

type #people

A 100

B 200

AB 50

O 50

0.20

0.38

0.42

a,

b, o

Learning iteration

Page 15: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

15

EM algorithm – ABO exampleData

type #people

A 100

B 200

AB 50

O 50

0.20

0.38

0.42

a,

b, o

Learning iteration

good convergence

Page 16: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

16

Gene Counting

Current formulation:

hidden variables corresponds to single allele

genotype

Gene-counting:

hidden variables corresponds to whole locus

genotype

• If we know the number of locus genotypes: na/a, na/o, na/b, nb/b, nb/o, no/o,

we can estimated all parameters: n

nnn baoaaaa 2

2 ///

n

nnn baobbbb 2

2 ///

n

nnn oaobooo 2

2 ///

• Instead, we estimate the number of such counts given some

initial .

nAB nO

Page 17: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

17

• E-step: compute Pr[Xi=x , pi]

E-step calculationshidden genoty

pe

observed phenotyp

e

Pr[X=a/b , P=AB] = 2a b Pr[X=o/o , P=O] = o2

Pr[X=a/o , P=A] = 2oa

Pr[X=a/a , P=A] = a 2

Pr[X=b/o , P=B] = 2ob

Pr[X=b/b , P=B] = b 2

1

ao

o

2

2

ao

a

2

bo

o

2

2

bo

b

2

1

Pr[Xi=x | pi] Pr[Xi=x | pi]

ob

oBob

ob

bBbb

oa

oAoa

oa

aAaa nnnnnnnn

2

2

22

2

2 ////

Page 18: Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

18

Gene Counting

EM algorithm for ABO:

n

nnn oaaaa 2

2 // AB

n

nnn obbbb 2

2 // AB

n

nnn oaobo 2

2 // O

E-step:

ob

oBob

ob

bBbb

oa

oAoa

oa

aAaa nnnnnnnn

2

2

22

2

2 ////

nnnn

nnn

nnn

OBob

oA

oa

oo

ABBob

obb

ABAoa

oaa

22'

2'

2'

21

21

M-step:

Same as slides 13