Heuristics for Phasing and Imputation
J. M. Hickey
www.alphagenes.roslin.ed.ac.uk@hickeyjohn
Genotype imputation
• Why we do it
• How we do it
• How we measure it
• How it performs
Genomic selection
Why is genomic selection attractive?
• Directly addresses 3 of the 4 components of genetic gain– Time -via generation interval
– Selection intensity -via cost– Accuracy -via better analysis/use of data
– It does not directly address diversity• Can do 2 things
– Mate selection
– Pre-breeding (could be powerful for sheep - at least in Ireland!!!)
Costs
• Training set phenotypes– Different designs = different cost/benefits
• Genotyping
• Opportunity cost– Competition will blow you out of the market– Potentially includes competition from other species
Genotyping costs
• Two components
– Large training population• Accuracy of breeding values
– Large numbers of selection candidates• Accuracy/response to selection trade-off
• Genotype imputation
Genotype imputation
• General idea– Genotype small number of individuals at high density– Genotype lots of individuals at low density– Recover information by imputation
– High density individuals• Haplotypes resolved
– Low density individuals• Combination of high density haplotypes determined
• Costs– $120 for 50k– DNA extraction $3– 50 SNP $2
Haplotype flow
• 6 generations random mating followed by 3 generations selection
• 200 individuals per generation• As we get further from the base
• The haplotypes get smaller• The more markers needed
What an animal pedigree looks like
• 6 generations random mating followed by 3 generations selection
• 200 individuals per generationGenerations
Dosage
• Diploid genomes
– Markers are AA, Aa, aA, or aa
– Label a=0 and A=1
– Thus the dosage is:• AA=2• Aa=1• aA=1• aa=0
Proband
MotherFather
10100111011100111001110011
00010011110010101100110011
01010111100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
What underlies a genotype?
10110122121110212101220022
11110222111111111111121021 10111121211121212211221121
How we do it
• Two steps
• Phase the genotypes of the high density individuals and identify the haplotypes
• Choose which combination of these haplotypes is carried by the individuals genotyped at low density– Underlying assumption is that haplotypes are preserved and
recombination's can be modelled
• In other words
• My father is heterozygous, is the 0 on his paternal gamete or his maternal gamete
• Did I receive my fathers paternal gamete or maternal gamete at this location and at the next location– 0 or a 1 from my father
The basic idea of imputation
General pedigree with its haplotypes represented
Segregation analysis and haplotype library imputation
• Individual’s are densely, sparsely, or not genotyped
• Pedigree information available
• Single locus segregation analysis for each SNP
• Match each pair of haplotypes with low density genotypes and genotype probabilities
1 2
7 8 119 10
13
6543
14 15
Haplotype library for population
Genotyping strategy in terms of high density, low density and not genotyped
Phasing/Imputation algorithms
• Two groups of imputation algorithms• Hidden markov based models• Heuristic methods
• Hidden markov based models• Probabilistic / pedigree free• Model linkage disequilibrium/short haplotypes• e.g. fastPHASE, Beagle, Impute2, Shape-IT, MaCH, minimach
• Heuristic methods• Tend not to be probabilistic / use pedigree information• Model linkage/long haplotypes• e.g. AlphaPhase/AlphaImpute (Long-range phasing and haplotype libraries),
fimpute, findhap
• Combined methods• Phasebook• AlphaPhase
Proband
MotherFather
10100111011100111001110011
00010011110010101100110011
01010111100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
Phasing a Trio
Proband
MotherFather
10100111011100111001110011
00010011110010101100110011
01010011100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
Phasing a Trio
Cannot phase this locus!!
Proband
Mother:
Father:
Other:
11110222111111111111121021
**************************
10110122121110212101220022
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
10111121211121212211221121
**************************
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
20212220111121221100222220
****X**X**************XX*X
10110122121110212101220022
Genotype
Not a surrogate parent!
Surrogate parents are the driver of long range phasing
Proband
Mother:
Father:1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10101110101111111111111110
10100111011100111001110011
00010011110010101100110011
1011111010101111110011111010101110010110110000111110
Other:
11110222111111111111121021
**************************
10110122121110212101220022
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
10111121211121212211221121
**************************
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
20212220111121221100222220
****X**X**************XX*X
10110122121110212101220022
Genotype
Not a surrogate parent!
Surrogate parents are the driver of long range phasing
Proband
Mother:
Father:1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10101110101111111111111110
10100111011100111001110011
00010011110010101100110011
Surrogate parents are the driver of long range phasing
Other:
11110222111111111111121021
**************************
10110122121110212101220022
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
10111121211121212211221121
**************************
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
20201221112110212012210121
**************************
10110122121110212101220022
Genotype
10100111011100111001110011
10101110101010101011100110
A surrogate parent!(Even without pedigree information)
Proband
MotherSurrogate Father
10100111011100111001110011
00010011110010101100110011
10101110101111111111111110
00010011110010101100110011
Phasing a Trio
Can now phase this locus!!
10100111011100111001110011
10101110101010101011100110
Could be a femaleCould be a descendantCould be many generations distantCan be ‘unrelated’
Proband
MotherFather
00010011110010
01010011100011 10101110101111
Erdös 1 Surrogate Fathers
10100111011100
10101110101011
10101010101000
10101010101111
10101110010110
10111110101011
11000110111110
Erdös 1 Surrogate Mothers
10101110010110
10101010101111
Erdös 2 Surrogate Mothers
Potentially many meiosis separating
Potentially many meiosis separating
• Surrogate parents share long haplotype with the proband.
• Erdös 1 surrogates are surrogates of the proband.
• Erdös n+1 surrogates are surrogates of Erdos n surrogates of the proband.
• Haplotype libraries for phasing and imputation
Surrogate giving phase information
Surrogate giving phase information
LRP-HLI of AlphaPhase in a nutshell
2020122111211
1011012212111
1111022211111
10100111011100
10100111011100
10100111011100
10100111011100 00010011110010
00010011110010
00010011110010
00010011110010
11000110111110
10111110101011
Kong et al., 2008Hickey et al., 2011
Haplotype library imputation
• Build library of all completely phased haplotypes
• Find haplotypes in the library which can explain an individuals genotype
• Low error rates
• Computationally fast
• Useful for extremely large data sets– Strategic use
101001110111001110011100111010000001000010001111001
101100110011001110011100111010011101100100100111001
1010010101110011100111001
1010011100110011100111000
111110111011100111001110011010010000000011100111001
Phasing results simulated data
Phasing results real data
The imputation problem for a 2.5k low density chip in pigs
What information do we have to solve this problem?
• Knowledge
• Low density genotypes
• Pedigree information
• Linkage
• Linkage disequilibrium
What knowledge do we have?
• Knowledge– Inheritance is “chunkular”– Chunks of DNA are inherited together– Recombination events breaks these chunks up– Approximately 70 recombination events during meiosis– Thus approximately 150 chunks per animal– WE CALL THESE CHUNKS HAPLOTYPES
• Pedigree information
• Linkage– Family statistic– Correlation between adjacent markers within a family– Long haplotype information
• Linkage disequilibrium– Population statistic– Correlation between adjacent markers within a population– Short haplotype information
How can we use low density information?
• We are trying to impute this individual
10110122121110212101220022
True genotype
..........................
....0..............1......
Low density genotype
How can we use pedigree information?
10110122121110212101220022
11110222111111111111121021 10111121211121212211221121
....0..............1......
True genotype
Low density genotype
MotherFather
....0.2............1.2..2.
How can we use pedigree information?
10110122121110212101220022
True genotype
Low density genotype
Three offspring
....0.2............1.2..2.
10220122121120212101120122
21110122121110212101220022
10120122121120212101121022
....0.2.....1......1.2..2.
How can we use linkage information?
• Two step procedure– Phase– Choose which parental haplotype was passed to offspring
• What is phasing?– Phasing strips a string of SNP into its two gametes/haplotypes– Genotypes represented as 0, 1, or 2– Each parent contributes a 0 or a 1– We use long range phasing to do this
1010011101110011100111001100010011110010101100110011
10110122121110212101220022
Paternal gamete
Maternal gamete
Genotype is the sum of the gametes
True phase
True genotype
How can we use linkage information?
10110122121110212101220022
True phase
Low density genotype
Mother Phased
Father Phased
.1..0.2.....1......1.2..2.
1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10100111011100111001110011 0001001111001010110011001110101110101111111111111110
....0111....0........1.01.
.0..0.1.1...1.1.11..11..1.Individual Phased
True genotype
How can we use pedigree information?
10110122121110212101220022
True phase
Low density genotype
Mother Phased
Father Phased
.1..0.2.....1......1.2..2.
1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10100111011100111001110011 0001001111001010110011001110101110101111111111111110
....0111....0........1.01.
.0..0.1.1...1.1.11..11..1.Individual Phased
True genotype
10100111011100111001110011
How can we use pedigree information?
10110122121110212101220022
Low density genotype
Mother Phased
Father Phased
.1..0.2.....1......1.2..2.
1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10100111011100111001110011 0001001111001010110011001110101110101111111111111110
....0111....0........1.01.
.0..0.1.1...1.1.11.011..1.
Individual Phased
10100111011100111001110011
True phase
How can we use pedigree information?
10110122121110212101220022
Low density genotype
Mother Phased
Father Phased
.1..0.2.....1......1.2..2.
1010011101110011100111001100010011110010101100110011
01010111100011000110011010
10100111011100111001110011 0001001111001010110011001110101110101111111111111110
....0111....0........1.01.
.0..0.1.1...1.1.11.011..1.
Individual Phased
1010011101110011100111001100010011110010101100110011
AlphaImpute compared to Impute2.0
• Data set for comparison– PIC single line test set– 6473 animals in pedigree file– 3200 genotyped at high density– 509 genotyped at low density
• These are animals in the current generation• We know their high density genotypes• Different categories of animals (which ancestors genotyped)
– High density chip contains 4221 Snp on chromosome 1
– Low density • 15% of SNP (approx. 7500 SNP density)• 10% of SNP (approx. 5000 SNP density)• 5% of SNP (approx. 2500 SNP density)• 1% of SNP (approx. 500 SNP density)
Results PIC pig data set
0.5k LD 2.5k LD 5k LD 7.5k LD
Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2
BothParents51 0.98 0.77 0.99 0.92 1.00 0.96 1.00 0.96
SireMGS62 0.93 0.80 0.98 0.92 0.99 0.94 0.99 0.96
DamPGS47 0.96 0.79 0.98 0.92 0.99 0.95 0.99 0.96
Sire45 0.89 0.78 0.97 0.92 0.99 0.95 0.99 0.97
Dam13 0.90 0.76 0.96 0.89 0.98 0.93 0.98 0.95
Other291 0.86 0.79 0.94 0.91 0.97 0.95 0.97 0.96
Correlation is the statistic that matters
The cost and accuracy of sensible strategies
nSires = 480nDams = 11884nCandidates = 100000
60k chip = $1206k chip = $483k chip = $35384 chip = $20
Scenarios Other MGS + PGS MGD + PGD Sire Dam Candidates Individual cost Accuracy of Imputation R2
SC1 60k 60k 0 60k 0 384 ! 0.878SC2 60k 60k 384 60k 384 384 $20.58 0.929SC3 60k 60k 3k 60k 3k 384 $24.74 0.950SC4 60k 60k 6k 60k 6k 384 $26.28 0.944SC5 60k 60k 60k 60k 60k 384 $34.84 0.964SC6 60k 60k 0 60k 0 3k ! 0.968SC7 60k 60k 384 60k 384 3k ! 0.972SC8 60k 60k 3k 60k 3k 3k $35.58 0.984SC9 60k 60k 6k 60k 6k 3k $41.28 0.983SC10 60k 60k 60k 60k 60k 3k $49.84 0.993SC11 60k 60k 0 60k 0 6k ! 0.982SC12 60k 60k 384 60k 384 6k ! 0.983SC13 60k 60k 3k 60k 3k 6k ! 0.986SC14 60k 60k 6k 60k 6k 6k $48.58 0.991SC15 60k 60k 60k 60k 60k 6k $62.84 0.996SC16 60k 60k 60k 60k 60k 60k $120.00 1.000
Imputation and GEBV accuracy
Genotyping Scenario Imputation accuracy
Other PGS+MGS PGD+MGD Sire Dam Progeny 4220 74 108 70 107 184 450 3k 6k
S1 H H H H H L 0.97 0.99 1.00 S2 H 0 0 H H L 0.95 0.98 0.99 S3 H H 0 H 0 L 0.91 0.97 0.98 S4 H H L H L L 0.94 0.99 0.99 !1!
N HD Geno.
Genotyping Scenario Imputed gEBV
Accuracy
Other PGS+MGS
PGD+MGD Sire Dam Progeny 450 3k 6k
S1 2519 H H H H H L 0.94 0.97 0.97 S2 2344 H 0 0 H H L 0.89 0.95 0.96 S3 2318 H H 0 H 0 L 0.87 0.92 0.93 S4 2318 H H L H L L 0.90 0.96 0.97 !1!
HMM - Weather example
• I am locked in an office without any windows • I want to predict what the weather is each day
• Each day my office mate “Andreas” comes– But we don’t talk
• Can I extract information from the behavior of Andreas?– Andreas likes ice-cream– He eats a different number of ice-creams each day– Could I use that to predict the weather outside?
Reality underlying the data
• Data– For 30 days I record the number of ice-creams
Andreas eats
• “Biological” knowledge– Just two weather states (Sunny or Cloudy)– Weather today is correlated with weather tomorrow– Correlation dissipates over time
Weather example
• Hidden markov process
• Markov
• Hidden– See ice-cream, but really modeling the weather
In the HMM
• Parameters– Transition matrix– Emission probabilities– Output probabilities
• Algorithms– Expectation Maximisation– Forward/backward– Viterbi
Weather example
• States – k1 = Cloudy– k2 = Sunny
• Time = days
• Transition matrix = A
Cloudy SunnyCloudy 0.9 0.1Sunny 0.7 0.3
Starting values
n Ice-creams 1 2 3Cloudy
emission 0.7 0.2 0.1Sunny emission 0.1 0.2 0.7
Cloudy SunnyCloudy 0.5 0.5Sunny 0.5 0.5
Back to the ice-cream
• DataDay n Ice-creams
1 22 33 34 25 36 27 38 29 2
10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2
ResultsDay n Ice-creams Day ProbC ProbS
1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99
10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 1 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96
Parameters
n Ice-creams 1 2 3Cloudy
emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57
Cloudy SunnyCloudy 0.89 0.11Sunny 0.07 0.93
How is this imputation?Day n Ice-creams Day ProbC ProbS
1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99
10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 # 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96
Day n Ice-creams1 22 33 34 25 36 27 38 29 2
10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2
Imputation
n Ice-creams 1 2 3Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57
Day ProbC ProbS15 0.99 0.01
Most likely number of ice-creams Andreas eats on day 15 =
0.99× 0.79×1( )+ 0.21×2( )+ 0.00×3( )"# $% + 0.01× 0.06×1( )+ 0.37×2( )+ 0.57×3( )"# $%
=1.22
True value = 1 Ice-cream on day 15Predict on the basis of the hidden states
Now put in the genetics
– Alleles (A=0, a=1)– Genotype data (0,1, or 2)
10100111011100111001110011
11110222111111111111121021
01010111100011000110011010
The phasing problem – split diplotype into a pair of haplotypes
The founder haplotype mosaic problem
HMM for Genetics
• fastPHASE• Beagle• MaCH• Impute2• AlphaPhase
• Haplotyping and imputation
fastPHASE
• The model– Haploid gametes underly Diploid genotypes
• Hidden markov process– Haploid gametes in present population derived from
ancient founder haplotypes• Hidden process
– Founder haplotypes can be considered to be cluster medoids
– fastPHASE can be considered to be a mixture model
The genetics
• Alleles are correlated along the haploid gametes– Closer alleles are more correlated
• fastPHASE is an IBD probability model– Each allele of each gamete has a probability of
deriving from each founder haplotype
• With this information we can do lots of things– Phase– Impute– Build genomic relationship matrices
30 animal example
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10Founder1 0 0 0 0 0 0 0 0 0 0Founder2 1 1 1 1 1 1 1 1 1 1Founder3 1 0 1 0 1 0 1 0 1 0Founder4 0 1 0 1 0 1 0 1 0 1
M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 1 1 2 2 2 2 2 2 2 12 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 05 0 2 0 2 0 2 0 1 0 16 2 2 2 2 2 2 2 2 1 17 0 1 0 1 0 1 0 1 0 18 0 2 0 2 0 2 0 1 0 09 0 2 0 1 0 1 0 0 0 010 0 1 1 1 1 1 1 1 1 011 0 1 0 1 0 1 0 1 0 112 0 1 0 1 0 1 0 0 0 013 1 1 1 1 1 1 1 1 1 114 0 1 0 2 0 1 0 0 0 015 0 1 0 1 0 1 0 0 0 016 0 0 0 0 0 0 0 0 0 017 1 1 1 1 1 1 1 1 0 018 0 0 0 0 0 0 0 0 0 019 1 2 1 2 1 2 1 2 1 120 0 1 0 1 0 1 0 1 0 121 1 2 1 2 1 2 1 2 0 122 0 0 0 1 0 1 0 1 0 123 1 2 1 2 1 2 0 1 0 024 0 1 0 0 0 2 1 1 0 025 2 2 2 2 2 2 1 2 1 226 1 1 1 1 1 1 1 1 0 027 0 1 0 1 0 1 1 2 0 128 0 0 0 0 0 0 0 0 0 029 0 0 0 0 0 0 0 1 0 130 0 2 0 2 0 2 0 1 0 0
M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 0 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 02 0 0 0 0 0 0 0 0 0 02 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 04 0 0 0 0 0 0 0 0 0 0
The parameters
• Slightly different to standard HMM– α, r, Θ
• α and r are partitions of the transition matrix– r = recombination rate between markers
• i.e. probability of a transition happening
– α is the frequency of each allele of each founder haplotype
• i.e. given there is a transition, where do I transition to
• Θ = the emission probability• The frequency of allele 1 in at position k in founder haplotype
j– (Emit a 1 as opposed to a 0)
HMM parameters
Theta0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.700.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.500.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.000.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
Alpha0.24 0.01 0.00 0.00 0.14 0.18 0.88 0.90 0.07 0.060.25 0.00 0.09 0.00 0.00 0.00 0.00 0.04 0.00 0.110.28 0.97 0.90 0.99 0.61 0.10 0.02 0.01 0.27 0.260.23 0.02 0.01 0.01 0.25 0.72 0.10 0.05 0.66 0.58
R0.0004 0.0004 0.0005 0.0001 0.0004 0.0017 0.0035 0.0004 0.0002
Ancestral haplotypes
ThetaF4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
Founder1 0 0 0 0 0 0 0 0 0 0
Founder2 1 1 1 1 1 1 1 1 1 1
Founder3 1 0 1 0 1 0 1 0 1 0
Founder4 0 1 0 1 0 1 0 1 0 1
Impute missing marker
• Combine output probabilities with the parameters of the model– With Theta – the emission probabilities
29 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0029 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
29 1 0.79 0.79 0.79 0.79 0.79 0.79 0.85 0.98 0.98 0.9829 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.0129 3 0.21 0.21 0.21 0.21 0.21 0.21 0.15 0.01 0.01 0.0129 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
F4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01
29 0 0 0 0 0 0 0 1 0 1• Genotype of individual 29 at marker 1
• Gamete 1 comes from founder 3
• Gamete 2 is a combination of founders 1 and 3
• Founders 1 and 3 emit a 0
• True genotype is a 0
Segregation analysis and haplotype library imputation
• Individual’s are densely, sparsely, or not genotyped
• Pedigree information available
• Single locus segregation analysis for each SNP
• Match each pair of haplotypes with low density genotypes and genotype probabilities
1 2
7 8 119 10
13
6543
14 15
Haplotype library for population
Genotyping strategy in terms of high density, low density and not genotyped
How we measure it
• Correlation between imputed dosage and true genotypes• R-squared
• Correlation within a family• Relates to the accuracy of imputing the Mendelian sampling term
• Don’t use the percentage of genotypes correctly imputed
How we measure it
• Percentage correct is bad for SNP• And will be worse for sequence data
658 WWW.CROPS.ORG CROP SCIENCE, VOL. 52, MARCH–APRIL 2012
0.32 for LDP-98.5% while it only marginally increased from 0.87 for LDP-75% to 0.90 for LDP-50%.
Effect of Minor Allele Frequency on Accuracy of ImputationWe classify genotypes into groups of MAF according to the following intervals: [0,0.025], [0.025,0.05], [0.05,0.075], [0.075,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4], and [0.4,0.5]. For each of these groups, we computed the average MAF, %Cor-rect, and rT,I in each of the imputation scenarios. Results are graphically displayed in Fig. 2 with %Correct and rT,I show-ing opposite patterns. When the MAF is very small, miss-ing genotypes are almost certain to be homozygous for the common allele and therefore %Correct is almost 100; this will happen even if one uses a naive imputation method (e.g., replacing missing genotypes with the most likely geno-type). On the other hand, the prior uncertainty about miss-ing genotypes increases with MAF and therefore, %Correct decreases with increasing MAF. However, as stated, %Cor-rect does not measure how much is gained with the imputa-tion algorithm over naive imputation procedures.
On the other hand, the rT,I measures how much is gained relative to naive imputation procedures and our results indicate that, with very low-density platforms, rT,I is lower for genotypes with extreme allele frequency. This occurs because lines carrying alleles that are present with very low frequency in the population share haplotypes for those genotypes with only a very small number of lines in the population. Therefore, imputation accuracy of those genotypes is poorer, especially when large segments of the chromosome are not genotyped, as it happens in LH
scenarios where the LDP contain only 1% of the geno-types of the HDP. This eff ect of allele frequency on rT,I was apparent only for genotypes with MAF < 0.1 and in LDP with more than 84% of masked genotypes.
Patterns of Local Linkage Disequilibrium and Its Effect on Imputation AccuracyFor every genotype in our HDP we quantifi ed local LD by means of 2
EMR . The distribution of this statistic was bimodal with most genotypes showing extremely low levels of local LD (53.8% of the genotypes had an estimated 2
EMR with adjacent genotypes smaller than 0.1) and many genotypes showing extremely high levels of LD with adjacent geno-types (10.6% of genotypes had 2
EMR with adjacent genotypes greater than 0.9). In the range defi ned by [ ]2 0.1,0.9EMR � the distribution of genotypes was rather uniform.
The relatively large proportion of genotypes having extremely low LD with adjacent genotypes is likely caused by a combination of several factors, including the varying genotype density across regions of the genome, the vary-ing levels of LD across the maize genome, the complexity of our dataset, which includes lines from diff erent popula-tions and therefore may give rise to complex LD patterns, and fi nally, some genotypes having extremely low local LD simply because of mapping errors. Regardless of what was the relative importance of each of these factors, the extremely low levels of LD observed for a large proportion of genotypes in our dataset is certainly limiting the level of imputation accuracy that one can achieve with this dataset.
Using 2EMR we classify genotypes into groups of 2
EMR and computed the average rT,I by level of 2
EMR for each
Figure 2. Average proportion of genotypes correctly imputed (left panel) and correlation between imputed and true genotype (right panel) versus range of minor allele frequency of the marker being imputed. The vertical dashed lines give the boundaries of minor allele frequency used to group markers. RAND-5%, a scenario where 5% of the markers were randomly chosen, masked, and subsequently imputed; LDP-x%, a LDP with x% of the genotypes masked.