![Page 1: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/1.jpg)
Hidden Markov Models of Haplotype
Diversity and Applications in
Genetic Epidemiology
Ion MandoiuUniversity of Connecticut
![Page 2: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/2.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 3: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/3.jpg)
Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)
High density in the human genome: 1 107 SNPs out of total 3 109 base pairs
Single Nucleotide Polymorphisms
… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …
![Page 4: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/4.jpg)
Haplotypes and Genotypes
Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father
Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor
Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)
allele; 2 - the chromosomes contain different alleles
011100110001000010021200210
+two haplotypes per individual
genotype
![Page 5: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/5.jpg)
Sources of Haplotype Diversity: Mutation
The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.
![Page 6: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/6.jpg)
Sources of Haplotype Diversity: Recombination
![Page 7: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/7.jpg)
Haplotype Structure in Human Populations
![Page 8: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/8.jpg)
Fi = founder haplotype at locus i, Hi = observed allele at locus i
P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data
For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm
Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]
HMM Model of Haplotype Frequencies
F1 F2 Fn…
H1 H2 Hn
![Page 9: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/9.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 10: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/10.jpg)
Genotype Phasing
g: 0010212 ?
h1:0010111
h2:0010010
h3:0010011
h4:0010110
![Page 11: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/11.jpg)
Maximum Likelihood Genotype Phasing
Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
F'1 F'2 F'n…
H'1 H'2 H'n
![Page 12: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/12.jpg)
Computational Complexity• [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP• [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)
![Page 13: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/13.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 14: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/14.jpg)
Genotyping Errors A real problem despite advances in technology &
typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are
inconsistent [Zaitlen et al. 2005]
Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004]
In pedigrees, some errors detected as Mendelian Inconsistencies (MIs)
Many errors remain undetected As much as 70% of errors are Mendelian consistent for
mother/father/child trios [Gordon et al. 1999]
![Page 15: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/15.jpg)
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
0 1 1 1 0 0 h1
0 0 0 1 0 1 h3
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
)()()()( MAX)( 4321 hphphphpTL
Likelihood Sensitivity Approach to Error Detection in Trios
![Page 16: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/16.jpg)
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child
Likelihood of best phasing for original trio T
)()()()( MAX)( 4321 hphphphpTL
? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Likelihood of best phasing for modified trio T’
)'()'()'()'( MAX)'( 4321 hphphphpTL
Likelihood Sensitivity Approach to Error Detection in Trios
![Page 17: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/17.jpg)
0 1 2 1 0 2
0 2 2 1 0 2
0 2 2 1 0 2
Mother Father
Child?
Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)
Likelihood Sensitivity Approach to Error Detection in Trios
![Page 18: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/18.jpg)
Alternate Likelihood Functions
• Efficiently Computable Likelihood Functions- Viterbi probability - Probability of Viterbi Haplotypes - Total Trio Probability
• [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP
![Page 19: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/19.jpg)
Comparison with FAMHAP (Children)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
TotalProb-UNO
TotalProb-DUO
TotalProb-TRIO
TotalProb-COMBINED
FAMHAP-1
FAMHAP-3
![Page 20: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/20.jpg)
Comparison with FAMHAP (Parents)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.005 0.01 0.015
FP rate
Sens
itivi
ty
TotalProb-UNO
TotalProb-DUO
TotalProb-TRIO
TotalProb-COMBINED
FAMHAP-1
FAMHAP-3
![Page 21: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/21.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 22: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/22.jpg)
Genome-Wide Association Studies Powerful method for finding genes associated with
complex human diseases Large number of markers (SNPs) typed in cases and
controls Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing
imputation of untyped Hapmap genotypes [WTCCC’07]
![Page 23: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/23.jpg)
HMM Based Genotype Imputation Train HMM using the haplotypes from related
Hapmap or small cohor typed at high density
Probability of missing genotypes given the typed genotype data
gi is imputed as )|,(argmax }2,1,0{ MxggPx iix
)|()|,(),|(
MgPMxggPMgxgP
i
iiii
![Page 24: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/24.jpg)
Experimental Results Estimates of the allele 0 frequency based on
Imputation vs. Illumina 15k
![Page 25: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/25.jpg)
Experimental Results Accuracy and missing data rate for imputed
genotypes at different thresholds
![Page 26: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/26.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 27: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/27.jpg)
Illumina / Solexa Genetic Analyzer 1G1000 Mb/run, 35bp reads
Roche / 454 Genome Sequencer FLX100 Mb/run, 400bp reads
Applied BiosystemsSOLiD3000 Mb/run, 25-35bp reads
New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing
Ultra-High Throughput Sequencing
![Page 28: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/28.jpg)
F1 F2 Fn…
H1 H2 Hn
G1 G2 Gn
…R1,1 R2,1
F'1 F'2 F'n…
H'1 H'2 H'n
R1,c … R2,c …Rn,1 Rn,c1 2 n
Probabilistic Model
![Page 29: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/29.jpg)
Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father
P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
where is the probability that read r has an error at locus I
Conditional probabilities for sets of reads are given by:
Model Training
)(1)(
)()(
)()(
)(1)(, 1
221
2)|( ir
irir
iriir
irir
iri
iijigggGrRP
1)(r
)(
0)(r
)( )1()0|r(ir
rir
irr
iriiii
GP
0)(r
)(
1)(r
)( )1()2|r(ir
rir
irr
iriiii
GP ic
ii GP
21)1|r(
)(ir
![Page 30: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/30.jpg)
Multilocus Genotyping ProblemGIVEN:
• Shotgun read sets r=(r1, r2, … , rn)• Base quality scores• HMMs for populations of origin for mother/father
FIND:• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum
posterior probability, i.e., g*=argmaxg P(g | r)
![Page 31: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/31.jpg)
Joint probabilities can be computed using a forward-backward algorithm:
Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs
Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08]
)()|r()r,( '' ''1 ,1 ,, i
iff
K
fi
ffi
ff
K
fiii ggPgPiii iiiii
Posterior Decoding Algorithm1. For each i = 1..n, compute2. Return *)*,...,(* 1 nggg
)r,(maxarg)r|(maxarg* igigi gPgPgii
![Page 32: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/32.jpg)
Homozygous Watson SNPs (Affy 500k)
30.1
93.5
50.5
95.6
73.7
97.490.5
98.5 96.3 99.1
0
20
40
60
80
100
120
0.35
x Bi
nom
ial
0.35
x Po
ster
ior
0.70
x Bi
nom
ial
0.70
x Po
ster
ior
1.41
x Bi
nom
ial
1.41
x Po
ster
ior
2.82
x Bi
nom
ial
2.82
x Po
ster
ior
5.64
x Bi
nom
ial
5.64
x Po
ster
ior
Heterozygous Watson SNPs (Affy 500k)
2.8
67.8
9.2
80.0
25.7
88.5
54.6
93.782.5
96.8
0
20
40
60
80
100
120
0.35
x Bi
nom
ial
0.35
x Po
ster
ior
0.70
x B
inom
ial
0.70
x Po
ster
ior
1.41
x B
inom
ial
1.41
x Po
ster
ior
2.82
x Bi
nom
ial
2.82
x Po
ster
ior
5.64
x Bi
nom
ial
5.64
x P
oste
rior
Genotyping Accuracy on Watson Reads
![Page 33: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/33.jpg)
HMM model of haplotype diversityApplications
- Phasing- Error detection- Imputation- Genotype calling from low-coverage
sequencing dataConclusions
Outline
![Page 34: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/34.jpg)
Conclusions HMM model of haplotype diversity provides a powerful
framework for addressing central problems in population genetics & genetic epidemiology
Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations
Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice
Highly scalable runtime (linear in #SNPs and #individuals/reads)
Software available at http://www.engr.uconn.edu/~ion/SOFT/
![Page 35: Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology](https://reader036.vdocuments.us/reader036/viewer/2022062323/56815d4c550346895dcb564d/html5/thumbnails/35.jpg)
Acknowledgements
Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc
NSF funding (awards IIS-0546457 and DBI-0543365)