learning – em in the abo locus tutorial #8 © ilan gronau. based on original slides of ydo wexler...
TRANSCRIPT
.
Learning – EM in The ABO locusTutorial #8
© Ilan Gronau.
Based on original slides of Ydo Wexler & Dan Geiger
2
Genotype statistics
Mendelian Genetics:• locus - a particular location on a chromosome (genome)
- Each locus has two copies – alleles (one paternal and one maternal)- Each copy has several relevant states - genotypes
• locus genotype is determined by the combined genotype of both copies.• locus genotype yields phenotype (physical features)
NN tsts ,,
We wish to estimate the distribution of all possible genotypes.
Suppose we randomly sample N individuals and found the
number Ns,t.
The MLE is given by: Sampling genotypes is costlySampling phenotypes is cheap
3
Example: The ABO locus
• ABO locus determines blood-type
• It has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}.
• They lead to four possible phenotypes: {A, B, AB, O}
We wish to estimate the proportion in a population of the 6 genotypes.
- Sample genotype – sequence a genomic region
- Sample phenotype - checking presence of antibodies (simple
blood test)
Problem: phenotype doesn’t reveal genotype (in case of A,B)
4
Example: The ABO locus
Problem: phenotype doesn’t reveal genotype
Assuming allele genotypes are distributed independently w.p: a,
b, o
we determine probabilities for locus genotypes:
• a/b=2a b ; a/o=2a o ; b/o=2b o
• a/a= a2 ; b/b=b
2 ; o/o=o2
Θ - model parameter set - Θ={a ,b ,o}
X – (hidden) genotype variable - Pr[X=x |Θ] = x
P – (observed) phenotype variable - Pr[P=p |Θ] = Σ(xp)(x)
e.g. Pr[P=A |Θ] = a/a+a/o = a2+2a o
Hardy-Weinbergequilibrium
5
Example: The ABO locus
Given a population phenotype sample: Data =
{B,A,B,B,O,A,B,A,O,B, AB}
the likelihood of our parameter set Θ={a ,b ,o} is: 215232 222]|Pr[ oobaobboaaData
A B AB O
• Maximum of this function yields the MLE
Use EM to obtain this
6
EM algorithmStart with some set of parameters- Θ.
Iterate until convergence:
• E-step:
calculate expectations of hidden variables implied by data and Θ
• M-step:
For every hidden variable X:
• Use expectations as statistics to yield MLE Θ’ given Θ
Hidden variables – allele genotypes
• If we knew the count of each allele genotype we could calculate MLE
Θ(={a ,b ,o})
• In the M-step we use the expected counts of allele genotypes (given Θ)
7
E-step:
E[#(x)] – The expected number of counts of genotype x in the maternal allele of each locus.
If the dataset has n phenotypes: p1…pn then: #(x)=Σi (Xi=x)
By linearity of expectation: E[#(x)]= Σi (E[Xi=x])
•
M-step:
•
y ii
iiiii pyX
pxXpxXxX
],Pr[
],Pr[]|Pr[E
y
x y
x
]|)([#E
]|)([#E'
EM algorithm – ABO example
indicator
hidden genoty
pe
observed phenotyp
e
n
8
• E-step: compute Pr[Xi=x , pi]
E-step calculationshidden genotype
(of paternal allele)
observed phenotyp
e
Pr[X=o , P=AB] = 0
Pr[X=a , P=AB] = a b
Pr[X=b , P=AB] = b a
Pr[X=o , P=O] = o2
Pr[X=a , P=O] = 0
Pr[X=b , P=O] = 0
Pr[X=o , P=A] = o a
Pr[X=a , P=A] = a (a +o)
Pr[X=b , P=A] = 0
Pr[X=o , P=B] = o b
Pr[X=a , P=B] = 0
Pr[X=b , P=B] = b (b +o)
0
½
½
0
ao
o
2
ao
oa
2
0bo
o
2
bo
ob
2
1
0
0
Pr[Xi=x | pi]Pr[ , ]Pr[ | ]
Pr[ , ]i i
i ii iy
X x pX x p
X y p
Pr[Xi=x | pi]
9
Datatype #people
A 100
B 200
AB 50
O 50
= {a, b, o} the parameter set we need to estimate
• Our initial guess is 0 = {0.2, 0.2, 0.6}
EM algorithm – ABO example
10
0= {0.2, 0.2, 0.6}
n=400 (data size)
n
iii paXaE
1
]|Pr[)(#
15.82050500200)2(
100 21
oa
oa
EM algorithm – ABO exampleData
type #people
A 100
B 200
AB 50
O 50
E-step (1st iteration):
},,,{
]|Pr[OABBApp paXn
A B AB O
11
0= {0.2, 0.2, 0.6}
n=400 (data size)
15.82050500200)2(
100)]([# 21
oa
oaaE
EM algorithm – ABO exampleData
type #people
A 100
B 200
AB 50
O 50
E-step (1st iteration):
A B AB O
29.13905050)2(
2000100)]([# 21
ob
obbE
57.178150050)2(
200)2(
100)]([#
ob
o
oa
ooE
400
447.040057.178';348.0400
29.139';205.040015.82' oba
M-step (1st iteration):
12
1= {0.205, 0.348, 0.447}
n=400 (data size)
33.84050500200)2(
100)]([# 21
oa
oaaE
EM algorithm – ABO exampleData
type #people
A 100
B 200
AB 50
O 50
E-step (2nd iteration):
A B AB O
02.15305050)2(
2000100)]([# 21
ob
obbE
65.162150050)2(
200)2(
100)]([#
ob
o
oa
ooE
400
406.040065.162';383.0400
02.153';211.040033.84' oba
M-step (2st iteration):
13
EM algorithm – ABO example
E-step + M-step :
General update formula:
nnnn
nnn
nnn
OBob
oA
oa
oo
ABBob
obb
ABAoa
oaa
22'
2'
2'
21
21
Datatype #people
A nA
B nB
AB nAB
O nO
14
EM algorithm – ABO exampleData
type #people
A 100
B 200
AB 50
O 50
0.20
0.38
0.42
a,
b, o
Learning iteration
15
EM algorithm – ABO exampleData
type #people
A 100
B 200
AB 50
O 50
0.20
0.38
0.42
a,
b, o
Learning iteration
good convergence
16
Gene Counting
Current formulation:
hidden variables corresponds to single allele
genotype
Gene-counting:
hidden variables corresponds to whole locus
genotype
• If we know the number of locus genotypes: na/a, na/o, na/b, nb/b, nb/o, no/o,
we can estimated all parameters: n
nnn baoaaaa 2
2 ///
n
nnn baobbbb 2
2 ///
n
nnn oaobooo 2
2 ///
• Instead, we estimate the number of such counts given some
initial .
nAB nO
17
• E-step: compute Pr[Xi=x , pi]
E-step calculationshidden genoty
pe
observed phenotyp
e
Pr[X=a/b , P=AB] = 2a b Pr[X=o/o , P=O] = o2
Pr[X=a/o , P=A] = 2oa
Pr[X=a/a , P=A] = a 2
Pr[X=b/o , P=B] = 2ob
Pr[X=b/b , P=B] = b 2
1
ao
o
2
2
ao
a
2
bo
o
2
2
bo
b
2
1
Pr[Xi=x | pi] Pr[Xi=x | pi]
ob
oBob
ob
bBbb
oa
oAoa
oa
aAaa nnnnnnnn
2
2
22
2
2 ////
18
Gene Counting
EM algorithm for ABO:
n
nnn oaaaa 2
2 // AB
n
nnn obbbb 2
2 // AB
n
nnn oaobo 2
2 // O
E-step:
ob
oBob
ob
bBbb
oa
oAoa
oa
aAaa nnnnnnnn
2
2
22
2
2 ////
nnnn
nnn
nnn
OBob
oA
oa
oo
ABBob
obb
ABAoa
oaa
22'
2'
2'
21
21
M-step:
Same as slides 13