an animal breeders introduction to hmm · an animal breeders introduction to hmm @hickeyjohn....
TRANSCRIPT
An animal breeders introduction to HMM
www.alphagenes.roslin.ed.ac.uk@hickeyjohn
Relationships between haplotypes
Hidden Markov models
• Lets think of Genetics– Alleles (A=0, a=1)– Genotype data (0,1, or 2)
10100111011100111001110011
11110222111111111111121021
01010111100011000110011010
The phasing problem – split diplotype into a pair of haplotypes
The founder haplotype mosaic problem
Missing pedigree
• One parent may be known and genotyped– Use heuristics
• Other parent not– Use HMM as it is pedigree free
• Logic can be extended to missing grandparents
We start with a simple HMM
• HMM – Hidden Markov Model– Hidden = it models processes that are not directly
observed– Markov = Given time n-1, what exists at time n is
conditionally independent of everything that went on before time n-1
– Model = Can make predictions on the basis of the model
• Genetics is perfect– Gametes are hidden– Markers are linked to the ones that went before– Lots of things to predict (e.g. missing markers, QTL)
HMM• In Genetics
– Gary Churchill– Beagle– fastPHASE– Impute2– AlphaPhase (combined with heuristics)– AlphaImpute (combined with heuristics)– MaCH
• In other fields – Speech recognition– Weather– Checking Casino’s
In speech recognition
Real improvements in accuracy of modeling weather
HMM - Weather example
• I am locked in an office without any windows • I want to predict what the weather is each day
• Each day my office mate “Andreas” comes– But we don’t talk
• Can I extract information from the behavior of Andreas?– Andreas likes ice-cream– He eats a different number of ice-creams each day– Could I use that to predict the weather outside?
Reality underlying the data
• Data– For 30 days I record the number of ice-creams
Andreas eats
• “Biological” knowledge– Just two weather states (Sunny or Cloudy)– Weather today is correlated with weather tomorrow– Correlation dissipates over time– Ice-cream intake is correlated with weather
Weather example• Hidden markov process
• Markov - the state at time n is conditionally dependent only to the state at time n-1
• Hidden– See ice-cream, but really modeling the weather
Hidden Markov models
• Lets think of Genetics– Alleles (A=0, a=1)– Genotype data (0,1, or 2)
10100111011100111001110011
11110222111111111111121021
01010111100011000110011010
The phasing problem – split diplotype into a pair of haplotypes
The founder haplotype mosaic problem
Discrete Markov process
• A system which may be described at any time as being in one of K distinct states (k1,k2,…,kk)
• At regularly spaced discrete times, the system undergoes a change according to a set of probabilities
• Time = n• Actual state at time n = xn• Transition probabilities = aij
Weather example
• States – kC = Cloudy– kS = Sunny
• Time = days
• Transition matrix = A
Cloudy SunnyCloudy 0.9 0.1Sunny 0.7 0.3
Weather example
• States – kC = Cloudy– kS = Sunny
• Time = days
• Transition matrix = A
Cloudy SunnyCloudy 0.9 0.1Sunny 0.7 0.3
p xn = ki | xn−1 = kj, xn−2 = kh,"# $%
= p xn = ki | xn−1 = kj"# $%
= ai, j
Observable Markov process
• We have been describing an observable Markov process
• Output of the process is a set of states at each instant of time
• Each state relates directly to a physical (observable) event– Andreas eats x ice-creams each day
We can enumerate
• For 7 consecutive days the weather is:– S,C,S,C,C,C,S (S=Sunny=kS; C=Cloudy=kC) – More formally
• O = observation sequence• O={kS, kC, kS, kC, kC, kC, kS }
• What is the probability of this happening?– Pr(O | Model) = Pr(kS, kC, kS, kC, kC, kC, kS | Model)
• Get our transition matrix• Initial state probability π = (kS=0.5, kC=0.5)
• πS� aS->C� aC->S� aS->C� aC->C� aC->C� aC->C� aC->S
• 0.5 � 0.7 � 0.1 � 0.7 � 0.9 � 0.9 � 0.1 = 0.00198
• Probability of a given realization is low, and imagine how low it would be with 1 million SNP, but concerned only with relative probability
Cloudy Sunny
Cloudy 0.9 0.1
Sunny 0.7 0.8
Hidden markov models
• Rather than observing each state directly
• Extend so that each observed state is a probabilistic function of an unobserved event– i.e. Andreas eats ice-cream– But not perfectly correlated with the weather– Latent variables versus observed variables
Emission probabilities
Transmission probabilities
Hidden Markov models
• For 7 consecutive days I observe Andreas eating– 3,1,3,1,2,1,3 (where n is the number of ice-creams eaten per day) – More formally
• O = observation sequence• O={x1=3, x2=1, x3=3, x4=1, x5=2, x6=1, x7=3}• Called Observed variables
• The weather is:– S,C,S,C,C,C,S (S=Sunny=k1; C=Cloudy=k2) – More formally
• St = state sequence• Latent variables• St={z1=S, z2=C, z3=S, z4=C, z5=C, z6=C, z7=S}
• Emission probabilities– How many ice-creams cloudy emits– How many ice-creams sunny emits
Lets show this visually
Genetic interpretation
Elements of a HMM
• K – the number of states in the model
• M – the number of distinct observation symbols per state
– Can be discrete (e.g. 1, 2, or 3 ice-creams)
• A – the transition probabilities
• Φ – Parameters controlling the distribution of Emission probabilities– Emission probabilities = P(xn | zn, Φ)
• What is the probability of emitting a certain symbol at time n given in state kj
• π - the initial state probabilities
• λ = (A, Φ, π)
3 aspects to be solved
• Given the observation sequence (O = O1,O2,…,OT) and the model (λ = (A, Φ, π))– How do we efficiently compute P(O|λ) – The probability of the observation sequence given the model– Solved via the Forward-Backward Algorithm
• Given the observation sequence (O = O1,O2,…,OT) and the model (λ = (A, Φ, π))– How do we choose a corresponding state sequence (Q=q1,q2,…,qT)
which is optimal in some meaningful way– Solved via the Viterbi algorithm
• How do we adjust the model parameters (λ = (A, Φ, π)) to maximize P(O|λ)– Solved via the Baum-Welch algorithm– (Same as the EM algorithm)
Unfold A to get Trellis
Trellis structure gives computational efficiency
Forward-Backward Algorithm
Forward probabilities = α
Trellis
Backward probabilities = β
= Emission
= Emission
= Emission
= Data
Compute P(O|λ)Enumerate every possible state sequence of length n
Viterbi Algorithm
• Forward algorithm sums over all paths
• Viterbi algorithm finds the most likely path– The most likely path has the shortest route through
the trellis (the smallest sum)– Not a fan of Viterbi, prefer to use all paths weighted
by their probability– Genetics = gives the most likely haplotype
Starting values
n Ice-creams 1 2 3
Cloudy emission 0.7 0.2 0.1Sunny emission 0.1 0.2 0.7
Cloudy SunnyCloudy 0.49 0.51Sunny 0.51 0.49
Back to the ice-cream
• DataDay n Ice-creams
1 22 33 34 25 36 27 38 29 2
10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2
ResultsDay n Ice-creams Day ProbC ProbS
1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99
10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 1 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96
Parameters
n Ice-creams 1 2 3
Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57
Cloudy SunnyCloudy 0.89 0.11Sunny 0.07 0.93
How is this imputation?Day n Ice-creams Day ProbC ProbS
1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99
10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 # 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96
Day n Ice-creams1 22 33 34 25 36 27 38 29 2
10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2
Imputation
n Ice-creams 1 2 3Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57
Day ProbC ProbS15 0.99 0.01
Most likely number of ice-creams Andreas eats on day 15 =
0.99× 0.79×1( )+ 0.21×2( )+ 0.00×3( )"# $% + 0.01× 0.06×1( )+ 0.37×2( )+ 0.57×3( )"# $%
=1.22
True value = 1 Ice-cream on day 15Predict on the basis of the hidden states
HMM for Genetics
• fastPHASE• Beagle• MaCH• Impute2• AlphaPhase (combined with heuristics)• AlphaImpute (combined with heuristics)
• Haplotyping and imputation
fastPHASE
• The model– Haploid gametes underly Diploid genotypes
• Hidden markov process– Haploid gametes in present population derived from
ancient founder haplotypes• Hidden process
– Founder haplotypes can be considered to be cluster medoids
– fastPHASE can be considered to be analogous to a mixture model
Now put in the genetics
– Alleles (A=0, a=1)– Genotype data (0,1, or 2)
10100111011100111001110011
11110222111111111111121021
01010111100011000110011010
The phasing problem – split diplotype into a pair of haplotypes
The founder haplotype mosaic problem
The genetics
• Alleles are correlated along the haploid gametes– Closer alleles are more correlated
• fastPHASE is an IBD probability model– Each allele of each gamete has a probability of
deriving from each founder haplotype
• With this information we can do lots of things– Phase– Impute– Build genomic relationship matrices
30 animal example
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10Founder1 0 0 0 0 0 0 0 0 0 0Founder2 1 1 1 1 1 1 1 1 1 1Founder3 1 0 1 0 1 0 1 0 1 0Founder4 0 1 0 1 0 1 0 1 0 1
M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 1 1 2 2 2 2 2 2 2 12 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 05 0 2 0 2 0 2 0 1 0 16 2 2 2 2 2 2 2 2 1 17 0 1 0 1 0 1 0 1 0 18 0 2 0 2 0 2 0 1 0 09 0 2 0 1 0 1 0 0 0 010 0 1 1 1 1 1 1 1 1 011 0 1 0 1 0 1 0 1 0 112 0 1 0 1 0 1 0 0 0 013 1 1 1 1 1 1 1 1 1 114 0 1 0 2 0 1 0 0 0 015 0 1 0 1 0 1 0 0 0 016 0 0 0 0 0 0 0 0 0 017 1 1 1 1 1 1 1 1 0 018 0 0 0 0 0 0 0 0 0 019 1 2 1 2 1 2 1 2 1 120 0 1 0 1 0 1 0 1 0 121 1 2 1 2 1 2 1 2 0 122 0 0 0 1 0 1 0 1 0 123 1 2 1 2 1 2 0 1 0 024 0 1 0 0 0 2 1 1 0 025 2 2 2 2 2 2 1 2 1 226 1 1 1 1 1 1 1 1 0 027 0 1 0 1 0 1 1 2 0 128 0 0 0 0 0 0 0 0 0 029 0 0 0 0 0 0 0 1 0 130 0 2 0 2 0 2 0 1 0 0
M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 0 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 02 0 0 0 0 0 0 0 0 0 02 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 04 0 0 0 0 0 0 0 0 0 0
The parameters• Slightly different to standard HMM
– α, r, Θ
• α and r are partitions of the transition matrix– r = recombination rate between markers
• i.e. probability of a transition happening
– α is the frequency of each allele of each founder haplotype
• i.e. given there is a transition, where do I transition to
• Θ = the emission probability• The frequency of allele 1 in at position k in founder haplotype j– (Emit a 1 as opposed to a 0)
HMM parameters
Theta0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.700.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.500.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.000.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
Alpha0.24 0.01 0.00 0.00 0.14 0.18 0.88 0.90 0.07 0.060.25 0.00 0.09 0.00 0.00 0.00 0.00 0.04 0.00 0.110.28 0.97 0.90 0.99 0.61 0.10 0.02 0.01 0.27 0.260.23 0.02 0.01 0.01 0.25 0.72 0.10 0.05 0.66 0.58
R0.0004 0.0004 0.0005 0.0001 0.0004 0.0017 0.0035 0.0004 0.0002
Ancestral haplotypes
ThetaF4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
Founder1 0 0 0 0 0 0 0 0 0 0
Founder2 1 1 1 1 1 1 1 1 1 1
Founder3 1 0 1 0 1 0 1 0 1 0
Founder4 0 1 0 1 0 1 0 1 0 1
Impute missing marker
• Combine output probabilities with the parameters of the model– With Theta – the emission probabilities
29 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0029 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
29 1 0.79 0.79 0.79 0.79 0.79 0.79 0.85 0.98 0.98 0.9829 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.0129 3 0.21 0.21 0.21 0.21 0.21 0.21 0.15 0.01 0.01 0.0129 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
F4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
29 0 0 0 0 0 0 0 1 0 1• Genotype of individual 29 at marker 1
• Gamete 1 comes from founder 3
• Gamete 2 is a combination of founders 1 and 3
• Founders 1 and 3 emit a 0
• True genotype is a 0
HMM of fastPHASE in an nutshell
Column IndexIDPaternal gameteMaternal gameteProbabilities for marker 1
For individual 29 it is highly probable that its two gametes derive from founder haplotype 2