genotype error detection using hidden markov models of haplotype diversity ion mandoiu cse...

Genotype Error Detection using

Hidden Markov Models of

Haplotype Diversity

Ion Mandoiu

CSE Department, University of Connecticut

Joint work with Justin Kennedy and Bogdan Pasaniuc

Outline

Introduction

Likelihood Sensitivity Approach to Error Detection

HMM-Based Algorithms

Experimental Results

Conclusion

3

Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

High density in the human genome: 1 107 SNPs out of total 3 109 base pairs

Single Nucleotide Polymorphisms

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

Haplotypes and Genotypes

Diploids: two homologous copies of each chromosome One inherited from mother and one from father

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)

allele; 2 - the chromosomes contain different alleles

011100110001000010021200210

+two haplotypes per individual

genotype

5

Identification and fine mapping of disease-related genes Methods: Linkage analysis, allele-sharing, association studies Genotype data: large pedigrees, sibling pairs, trios,

unrelated

Why SNP Genotypes?

Genotyping Errors

A real problem despite advances in genotyping technology [Zaitlen et al. 2005] found 1.1% inconsistencies among the 20

million dbSNP genotypes typed multiple times

Error types Systematic errors (e.g., assay failure) detected by departure

from HWE [Hosking et al. 2004] For pedigree data some errors detected as Mendelian

Inconsistencies (MIs) Undetected errors

E.g., if mother/father/child are all heterozygous, any error is Mendelian consistent

Only ~30% detectable as MIs for trios [Gordon et al. 1999]

Effects of Undetected Genotyping Errors

Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based)

Errors as low as .1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]

1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

Related Work

Improved genotype calling algorithms [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06]

Explicit modeling in analysis methods [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] Computationally complex

Separate error detection step [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] Detected errors can be retyped, imputed, or ignored in

downstream analyses

Outline

Introduction




Conclusion

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06]

0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

0 1 1 1 0 0 h1

0 0 0 1 0 1 h3

0 1 1 1 0 0 h1

0 1 0 1 0 1 h2

0 0 0 1 0 1 h3

0 1 1 1 0 0 h4

)()()()( MAX)( 4321 hphphphpTL


0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

Likelihood of best phasing for original trio T

)()()()( MAX)( 4321 hphphphpTL

? 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3

0 1 0 1 0 1 h’1

0 1 1 1 0 0 h’2

0 0 0 1 0 0 h’ 3

0 1 1 1 0 1 h’ 4

Likelihood of best phasing for modified trio T’

)'()'()'()'( MAX)'( 4321 hphphphpTL


0 1 2 1 0 2

0 2 2 1 0 2

0 2 2 1 0 2

Mother Father

Child

?

Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

Implementation in FAMHAP[Becker et al. 06]

Window-based algorithm For each window including the SNP

under test, generate list of H most frequent haplotypes (default H=50)

Find most likely trio phasings by pruned search over the H4 quadruples of frequent haplotypes

Flag genotype as an error if L(T’)/L(T) > R for at least one window

Mother …201012 1 02210...Father …201202 2 10211...Child …000120 2 21021...

Limitations of FAMHAP Implementation

Truncating the list of haplotypes to size H may lead to sub-optimal phasings and inaccurate L(T) values

False positives caused by nearby errors (due to the use of multiple short windows)

Our approach: HMM model of haplotype diversity all haplotypes are

represented + no need for short windows Alternate likelihood functions scalable runtime

Outline

Introduction




Conclusion

HMM Model

Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05]

Unlike [Scheet&Stephens 06], recombination ratios not modeled explicitly

Block-free model, paths with high transition probability correspond to “founder” haplotypes

(Figure from Rastas et al. 07)

HMM Training

Previous works use EM training of HMM based on unrelated genotype data

Our 2-step algorithm exploits pedigree info Step 1: Infer haplotypes using pedigree-aware algorithm

based on entropy-minimization Step 2: train HMM based on inferred haplotypes, using

Baum-Welch

Complexity of Computing Maximum Phasing Probability

• For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(f½-) unless ZPP=NP, where f is the number of founders

• For trios, hard to approx. within O(f1/4 -)

• Reductions from the clique problem

Alternate Likelihood Functions

• Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio

• Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes

• Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths

For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time

K3 speed-up by factoring common terms:

Efficient Computation of Viterbi Probability for Trios

)( 8NKO

)},'()',,,;({max),,,;1(),,,;1( 4443213'43214321 4qqqqqqjPreqqqqjEqqqqjV

jQq

• = maximum probability of emitting SNP genotypes at locus j+1 from states • = transition probability

),,,;1( 4321 qqqqjE ),,,( 4321 qqqq

Where:

Viterbi probability Likelihoods of all 3N modified trios can be computed within

time using forward-backward algorithm Overall runtime for M trios

Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then compute

haplotype probabilities using forward algorithms Overall runtime

Total trio probability Similar pre-computation speed-up & forward-backward algorithm Overall runtime

Overall Runtimes

)( 5MNKO

))(( 25 KNNKMO

)( 5MNKO

)( 5NKO

Outline

Introduction




Conclusion

Datasets

Real dataset [Becker et al. 2006] 35 SNP loci on chromosome 16 covering a region of

91kb 551 trios

Synthetic datasets 35 SNPs, 30-551 trios Preserved missing data pattern of real dataset Haplotypes assigned to trios based on frequencies

inferred from real dataset 1% error rate, four error insertion models

Random allele Random genotype Heterozygous-to-homozygous Homozygous-to-heterozygous

Experimental Setup

Two strategies for handling MIs Set all three individuals to unknown prior to error

detection, or Set child only to unknown (preserving parents’ original

data)

Two testing strategies Test one SNP genotype: ViterbiProb-1, ViterbiHaps-1,

TotalProb-1 Simultaneously test three SNP genotypes at the same

locus: ViterbiProb-3, ViterbiHaps-3, TotalProb-3

Comparison with FAMHAP (Random Allele Errors)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0001 0.001 0.01 0.1 1

FP rate

Sen

siti

vity

TrioProb-1

ViterbiHaps-1

ViterbiProb-1

FAMHAP-1

TrioProb-3

ViterbiHaps-3

ViterbiProb-3

FAMHAP-3

Children vs. Parents (Random Allele Errors)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015 0.02 0.025

FP rate

Sen

siti

vity

TrioProb-1-P

TrioProb-1-C

Error Model Comparison(TrioProb-1 Parents)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015 0.02 0.025

FP rate

Se

ns

itiv

ity

Random Allele 1%

Random Genotype 1%

Heterozygous-to-Homozygous 1%

Homozygous-to-Heterozygous 1%

TrioProb-1 Results on Real Dataset

[Becker et al. 06] resequenced all trio members at 41 loci flagged by FAMHAP-3

23 SNP genotypes were identified as true errors 41*3-23=100 resequenced SNP genotypes agree with

original calls Predictive value for R=104 is between 18/26=69% and

24/26=92%, compared to 23/41=56% for FAMHAP-3

Threshold 2 3 4 2 3 4 2 3 4 2 3 4Parents 80 15 9 9 9 8 2 1 1 69 5 0Children 27 21 17 11 10 10 3 3 1 13 8 6Total 107 36 26 20 19 18 5 4 2 82 13 6

Total Signals True Positives False Positives Unknown

Pedigree Info vs. Sample Size Effect

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.005 0.01 0.015 0.02 0.025 0.03

FP rate

Sen

siti

vity 551-TrioProb-1-T

129-TrioProb-1-T

30-TrioProb-1-T

551-Unrelated-ViterbiProb-1

Unrelated vs. Trio Likelihood Sensitivity

1

10

100

1000

10000

100000

0.1

0.3

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

2.1

2.3

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

4.3

4.5

4.7

4.9 >5

no error error

1

10

100

1000

10000

100000

no error error

Unrelated ViterbiProb-1 Likelihood ratios (children)

Trio ViterbiProb-1 Likelihood ratios (children)

Combining Likelihood Functions (Children, Random Allele Model)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.002 0.004 0.006 0.008 0.01

FAMHAP

Unrelated

Duo

Trio

MinUT

MinDT

MinUDT

Majority

MinUD

Combining Likelihood Functions (Parents, Random Allele Model)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.002 0.004 0.006 0.008 0.01

FAMHAP

Unrelated

Duo

Trio

MinUT

MinDT

MinUDT

Majority

MinUD

Outline

Introduction




Conclusion

Conclusion

Proposed efficient methods for error detection in trio genotype data based on a HMM model of haplotype diversity

Significantly improved detection accuracy compared to FAMHAP High sensitivity even for very low FP rates Runtime linear in #SNPs and #trios

Ongoing work Iterative error detection

Fix MIs using likelihood before error detection Correct errors with high likelihood ratio, then recompute likelihood

ratios (possibly after re-phasing and HMM re-training) Integration with genotype calling algorithms

Combine low level intensity data with haplotype-based likelihoods Most useful when less pedigree info is available (unrelated, sibling

pairs w/o parent genotypes, parents in trios) Locus specific thresholds, p-values

Via simulations similar to [Douglas et al. 00]

Questions?

genotype error detection using hidden markov models of haplotype diversity ion mandoiu cse...

Documents

genotype error detection

error detection becker

individual genotype

haplotypebased errors

low error levels

bogdan pasaniuc slide

father haplotype

minor genotype