an animal breeders introduction to hmm · an animal breeders introduction to hmm @hickeyjohn....

An animal breeders introduction to HMM

www.alphagenes.roslin.ed.ac.uk@hickeyjohn

Relationships between haplotypes

Hidden Markov models

• Lets think of Genetics– Alleles (A=0, a=1)– Genotype data (0,1, or 2)

10100111011100111001110011

11110222111111111111121021

01010111100011000110011010

The phasing problem – split diplotype into a pair of haplotypes

The founder haplotype mosaic problem

Missing pedigree

• One parent may be known and genotyped– Use heuristics

• Other parent not– Use HMM as it is pedigree free

• Logic can be extended to missing grandparents

We start with a simple HMM

• HMM – Hidden Markov Model– Hidden = it models processes that are not directly

observed– Markov = Given time n-1, what exists at time n is

conditionally independent of everything that went on before time n-1

– Model = Can make predictions on the basis of the model

• Genetics is perfect– Gametes are hidden– Markers are linked to the ones that went before– Lots of things to predict (e.g. missing markers, QTL)

HMM• In Genetics

– Gary Churchill– Beagle– fastPHASE– Impute2– AlphaPhase (combined with heuristics)– AlphaImpute (combined with heuristics)– MaCH

• In other fields – Speech recognition– Weather– Checking Casino’s

In speech recognition

Real improvements in accuracy of modeling weather

HMM - Weather example

• I am locked in an office without any windows • I want to predict what the weather is each day

• Each day my office mate “Andreas” comes– But we don’t talk

• Can I extract information from the behavior of Andreas?– Andreas likes ice-cream– He eats a different number of ice-creams each day– Could I use that to predict the weather outside?

Reality underlying the data

• Data– For 30 days I record the number of ice-creams

Andreas eats

• “Biological” knowledge– Just two weather states (Sunny or Cloudy)– Weather today is correlated with weather tomorrow– Correlation dissipates over time– Ice-cream intake is correlated with weather

Weather example• Hidden markov process

• Markov - the state at time n is conditionally dependent only to the state at time n-1

• Hidden– See ice-cream, but really modeling the weather


• Lets think of Genetics– Alleles (A=0, a=1)– Genotype data (0,1, or 2)

10100111011100111001110011

11110222111111111111121021

01010111100011000110011010



Discrete Markov process

• A system which may be described at any time as being in one of K distinct states (k1,k2,…,kk)

• At regularly spaced discrete times, the system undergoes a change according to a set of probabilities

• Time = n• Actual state at time n = xn• Transition probabilities = aij

Weather example

• States – kC = Cloudy– kS = Sunny

• Time = days

• Transition matrix = A

Cloudy SunnyCloudy 0.9 0.1Sunny 0.7 0.3

Weather example

• States – kC = Cloudy– kS = Sunny

• Time = days

• Transition matrix = A


p xn = ki | xn−1 = kj, xn−2 = kh,"# $%

= p xn = ki | xn−1 = kj"# $%

= ai, j

Observable Markov process

• We have been describing an observable Markov process

• Output of the process is a set of states at each instant of time

• Each state relates directly to a physical (observable) event– Andreas eats x ice-creams each day

We can enumerate

• For 7 consecutive days the weather is:– S,C,S,C,C,C,S (S=Sunny=kS; C=Cloudy=kC) – More formally

• O = observation sequence• O={kS, kC, kS, kC, kC, kC, kS }

• What is the probability of this happening?– Pr(O | Model) = Pr(kS, kC, kS, kC, kC, kC, kS | Model)

• Get our transition matrix• Initial state probability π = (kS=0.5, kC=0.5)

• πS� aS->C� aC->S� aS->C� aC->C� aC->C� aC->C� aC->S

• 0.5 � 0.7 � 0.1 � 0.7 � 0.9 � 0.9 � 0.1 = 0.00198

• Probability of a given realization is low, and imagine how low it would be with 1 million SNP, but concerned only with relative probability

Cloudy Sunny

Cloudy 0.9 0.1

Sunny 0.7 0.8

Hidden markov models

• Rather than observing each state directly

• Extend so that each observed state is a probabilistic function of an unobserved event– i.e. Andreas eats ice-cream– But not perfectly correlated with the weather– Latent variables versus observed variables

Emission probabilities

Transmission probabilities


• For 7 consecutive days I observe Andreas eating– 3,1,3,1,2,1,3 (where n is the number of ice-creams eaten per day) – More formally

• O = observation sequence• O={x1=3, x2=1, x3=3, x4=1, x5=2, x6=1, x7=3}• Called Observed variables

• The weather is:– S,C,S,C,C,C,S (S=Sunny=k1; C=Cloudy=k2) – More formally

• St = state sequence• Latent variables• St={z1=S, z2=C, z3=S, z4=C, z5=C, z6=C, z7=S}

• Emission probabilities– How many ice-creams cloudy emits– How many ice-creams sunny emits

Lets show this visually

Genetic interpretation

Elements of a HMM

• K – the number of states in the model

• M – the number of distinct observation symbols per state

– Can be discrete (e.g. 1, 2, or 3 ice-creams)

• A – the transition probabilities

• Φ – Parameters controlling the distribution of Emission probabilities– Emission probabilities = P(xn | zn, Φ)

• What is the probability of emitting a certain symbol at time n given in state kj

• π - the initial state probabilities

• λ = (A, Φ, π)

3 aspects to be solved

• Given the observation sequence (O = O1,O2,…,OT) and the model (λ = (A, Φ, π))– How do we efficiently compute P(O|λ) – The probability of the observation sequence given the model– Solved via the Forward-Backward Algorithm

• Given the observation sequence (O = O1,O2,…,OT) and the model (λ = (A, Φ, π))– How do we choose a corresponding state sequence (Q=q1,q2,…,qT)

which is optimal in some meaningful way– Solved via the Viterbi algorithm

• How do we adjust the model parameters (λ = (A, Φ, π)) to maximize P(O|λ)– Solved via the Baum-Welch algorithm– (Same as the EM algorithm)

Unfold A to get Trellis

Trellis structure gives computational efficiency

Forward-Backward Algorithm

Forward probabilities = α

Trellis

Backward probabilities = β

= Emission

= Emission

= Emission

= Data

Compute P(O|λ)Enumerate every possible state sequence of length n

Viterbi Algorithm

• Forward algorithm sums over all paths

• Viterbi algorithm finds the most likely path– The most likely path has the shortest route through

the trellis (the smallest sum)– Not a fan of Viterbi, prefer to use all paths weighted

by their probability– Genetics = gives the most likely haplotype

Starting values

n Ice-creams 1 2 3

Cloudy emission 0.7 0.2 0.1Sunny emission 0.1 0.2 0.7


Back to the ice-cream

• DataDay n Ice-creams

1 22 33 34 25 36 27 38 29 2

10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2

ResultsDay n Ice-creams Day ProbC ProbS

1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99

10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 1 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96

Parameters

n Ice-creams 1 2 3

Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57


How is this imputation?Day n Ice-creams Day ProbC ProbS

1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99

10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 # 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96

Day n Ice-creams1 22 33 34 25 36 27 38 29 2

10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2

Imputation

n Ice-creams 1 2 3Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57

Day ProbC ProbS15 0.99 0.01

Most likely number of ice-creams Andreas eats on day 15 =

0.99× 0.79×1( )+ 0.21×2( )+ 0.00×3( )"# $% + 0.01× 0.06×1( )+ 0.37×2( )+ 0.57×3( )"# $%

=1.22

True value = 1 Ice-cream on day 15Predict on the basis of the hidden states

HMM for Genetics

• fastPHASE• Beagle• MaCH• Impute2• AlphaPhase (combined with heuristics)• AlphaImpute (combined with heuristics)

• Haplotyping and imputation

fastPHASE

• The model– Haploid gametes underly Diploid genotypes

• Hidden markov process– Haploid gametes in present population derived from

ancient founder haplotypes• Hidden process

– Founder haplotypes can be considered to be cluster medoids

– fastPHASE can be considered to be analogous to a mixture model

Now put in the genetics

– Alleles (A=0, a=1)– Genotype data (0,1, or 2)

10100111011100111001110011

11110222111111111111121021

01010111100011000110011010



The genetics

• Alleles are correlated along the haploid gametes– Closer alleles are more correlated

• fastPHASE is an IBD probability model– Each allele of each gamete has a probability of

deriving from each founder haplotype

• With this information we can do lots of things– Phase– Impute– Build genomic relationship matrices

30 animal example

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10Founder1 0 0 0 0 0 0 0 0 0 0Founder2 1 1 1 1 1 1 1 1 1 1Founder3 1 0 1 0 1 0 1 0 1 0Founder4 0 1 0 1 0 1 0 1 0 1

M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 1 1 2 2 2 2 2 2 2 12 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 05 0 2 0 2 0 2 0 1 0 16 2 2 2 2 2 2 2 2 1 17 0 1 0 1 0 1 0 1 0 18 0 2 0 2 0 2 0 1 0 09 0 2 0 1 0 1 0 0 0 010 0 1 1 1 1 1 1 1 1 011 0 1 0 1 0 1 0 1 0 112 0 1 0 1 0 1 0 0 0 013 1 1 1 1 1 1 1 1 1 114 0 1 0 2 0 1 0 0 0 015 0 1 0 1 0 1 0 0 0 016 0 0 0 0 0 0 0 0 0 017 1 1 1 1 1 1 1 1 0 018 0 0 0 0 0 0 0 0 0 019 1 2 1 2 1 2 1 2 1 120 0 1 0 1 0 1 0 1 0 121 1 2 1 2 1 2 1 2 0 122 0 0 0 1 0 1 0 1 0 123 1 2 1 2 1 2 0 1 0 024 0 1 0 0 0 2 1 1 0 025 2 2 2 2 2 2 1 2 1 226 1 1 1 1 1 1 1 1 0 027 0 1 0 1 0 1 1 2 0 128 0 0 0 0 0 0 0 0 0 029 0 0 0 0 0 0 0 1 0 130 0 2 0 2 0 2 0 1 0 0

M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 0 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 02 0 0 0 0 0 0 0 0 0 02 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 04 0 0 0 0 0 0 0 0 0 0

The parameters• Slightly different to standard HMM

– α, r, Θ

• α and r are partitions of the transition matrix– r = recombination rate between markers

• i.e. probability of a transition happening

– α is the frequency of each allele of each founder haplotype

• i.e. given there is a transition, where do I transition to

• Θ = the emission probability• The frequency of allele 1 in at position k in founder haplotype j– (Emit a 1 as opposed to a 0)

HMM parameters

Theta0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.700.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.500.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.000.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

Alpha0.24 0.01 0.00 0.00 0.14 0.18 0.88 0.90 0.07 0.060.25 0.00 0.09 0.00 0.00 0.00 0.00 0.04 0.00 0.110.28 0.97 0.90 0.99 0.61 0.10 0.02 0.01 0.27 0.260.23 0.02 0.01 0.01 0.25 0.72 0.10 0.05 0.66 0.58

R0.0004 0.0004 0.0005 0.0001 0.0004 0.0017 0.0035 0.0004 0.0002

Ancestral haplotypes

ThetaF4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

Founder1 0 0 0 0 0 0 0 0 0 0

Founder2 1 1 1 1 1 1 1 1 1 1

Founder3 1 0 1 0 1 0 1 0 1 0

Founder4 0 1 0 1 0 1 0 1 0 1

Impute missing marker

• Combine output probabilities with the parameters of the model– With Theta – the emission probabilities

29 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0029 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

29 1 0.79 0.79 0.79 0.79 0.79 0.79 0.85 0.98 0.98 0.9829 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.0129 3 0.21 0.21 0.21 0.21 0.21 0.21 0.15 0.01 0.01 0.0129 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

F4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

29 0 0 0 0 0 0 0 1 0 1• Genotype of individual 29 at marker 1

• Gamete 1 comes from founder 3

• Gamete 2 is a combination of founders 1 and 3

• Founders 1 and 3 emit a 0

• True genotype is a 0

HMM of fastPHASE in an nutshell

Column IndexIDPaternal gameteMaternal gameteProbabilities for marker 1

For individual 29 it is highly probable that its two gametes derive from founder haplotype 2

an animal breeders introduction to hmm · an animal breeders introduction to hmm @hickeyjohn....

Documents