heuristics for phasing and imputation ......why is genomic selection attractive? •directly...

Heuristics for Phasing and Imputation

J. M. Hickey

www.alphagenes.roslin.ed.ac.uk@hickeyjohn

Genotype imputation

• Why we do it

• How we do it

• How we measure it

• How it performs

Genomic selection

Why is genomic selection attractive?

• Directly addresses 3 of the 4 components of genetic gain– Time -via generation interval

– Selection intensity -via cost– Accuracy -via better analysis/use of data

– It does not directly address diversity• Can do 2 things

– Mate selection

– Pre-breeding (could be powerful for sheep - at least in Ireland!!!)

Costs

• Training set phenotypes– Different designs = different cost/benefits

• Genotyping

• Opportunity cost– Competition will blow you out of the market– Potentially includes competition from other species

Genotyping costs

• Two components

– Large training population• Accuracy of breeding values

– Large numbers of selection candidates• Accuracy/response to selection trade-off

• Genotype imputation

Genotype imputation

• General idea– Genotype small number of individuals at high density– Genotype lots of individuals at low density– Recover information by imputation

– High density individuals• Haplotypes resolved

– Low density individuals• Combination of high density haplotypes determined

• Costs– $120 for 50k– DNA extraction $3– 50 SNP $2

Haplotype flow

• 6 generations random mating followed by 3 generations selection

• 200 individuals per generation• As we get further from the base

• The haplotypes get smaller• The more markers needed

What an animal pedigree looks like

• 6 generations random mating followed by 3 generations selection

• 200 individuals per generationGenerations

Dosage

• Diploid genomes

– Markers are AA, Aa, aA, or aa

– Label a=0 and A=1

– Thus the dosage is:• AA=2• Aa=1• aA=1• aa=0

Proband

MotherFather

10100111011100111001110011

00010011110010101100110011

01010111100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

What underlies a genotype?

10110122121110212101220022

11110222111111111111121021 10111121211121212211221121

How we do it

• Two steps

• Phase the genotypes of the high density individuals and identify the haplotypes

• Choose which combination of these haplotypes is carried by the individuals genotyped at low density– Underlying assumption is that haplotypes are preserved and

recombination's can be modelled

• In other words

• My father is heterozygous, is the 0 on his paternal gamete or his maternal gamete

• Did I receive my fathers paternal gamete or maternal gamete at this location and at the next location– 0 or a 1 from my father

The basic idea of imputation

General pedigree with its haplotypes represented

Segregation analysis and haplotype library imputation

• Individual’s are densely, sparsely, or not genotyped

• Pedigree information available

• Single locus segregation analysis for each SNP

• Match each pair of haplotypes with low density genotypes and genotype probabilities

1 2

7 8 119 10

13

6543

14 15

Haplotype library for population

Genotyping strategy in terms of high density, low density and not genotyped

Phasing/Imputation algorithms

• Two groups of imputation algorithms• Hidden markov based models• Heuristic methods

• Hidden markov based models• Probabilistic / pedigree free• Model linkage disequilibrium/short haplotypes• e.g. fastPHASE, Beagle, Impute2, Shape-IT, MaCH, minimach

• Heuristic methods• Tend not to be probabilistic / use pedigree information• Model linkage/long haplotypes• e.g. AlphaPhase/AlphaImpute (Long-range phasing and haplotype libraries),

fimpute, findhap

• Combined methods• Phasebook• AlphaPhase

Proband

MotherFather

10100111011100111001110011

00010011110010101100110011

01010111100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

Phasing a Trio

Proband

MotherFather

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

Phasing a Trio

Cannot phase this locus!!

Proband

Mother:

Father:

Other:

11110222111111111111121021

**************************

10110122121110212101220022

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

10111121211121212211221121

**************************

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

20212220111121221100222220

****X**X**************XX*X

10110122121110212101220022

Genotype

Not a surrogate parent!

Surrogate parents are the driver of long range phasing

Proband

Mother:

Father:1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10101110101111111111111110

10100111011100111001110011

00010011110010101100110011

1011111010101111110011111010101110010110110000111110

Other:

11110222111111111111121021

**************************

10110122121110212101220022

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

10111121211121212211221121

**************************

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

20212220111121221100222220

****X**X**************XX*X

10110122121110212101220022

Genotype

Not a surrogate parent!


Proband

Mother:

Father:1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10101110101111111111111110

10100111011100111001110011

00010011110010101100110011


Other:

11110222111111111111121021

**************************

10110122121110212101220022

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

10111121211121212211221121

**************************

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

20201221112110212012210121

**************************

10110122121110212101220022

Genotype

10100111011100111001110011

10101110101010101011100110

A surrogate parent!(Even without pedigree information)

Proband

MotherSurrogate Father

10100111011100111001110011

00010011110010101100110011

10101110101111111111111110

00010011110010101100110011

Phasing a Trio

Can now phase this locus!!

10100111011100111001110011

10101110101010101011100110

Could be a femaleCould be a descendantCould be many generations distantCan be ‘unrelated’

Proband

MotherFather

00010011110010

01010011100011 10101110101111

Erdös 1 Surrogate Fathers

10100111011100

10101110101011

10101010101000

10101010101111

10101110010110

10111110101011

11000110111110

Erdös 1 Surrogate Mothers

10101110010110

10101010101111

Erdös 2 Surrogate Mothers

Potentially many meiosis separating

Potentially many meiosis separating

• Surrogate parents share long haplotype with the proband.

• Erdös 1 surrogates are surrogates of the proband.

• Erdös n+1 surrogates are surrogates of Erdos n surrogates of the proband.

• Haplotype libraries for phasing and imputation

Surrogate giving phase information

Surrogate giving phase information

LRP-HLI of AlphaPhase in a nutshell

2020122111211

1011012212111

1111022211111

10100111011100

10100111011100

10100111011100

10100111011100 00010011110010

00010011110010

00010011110010

00010011110010

11000110111110

10111110101011

Kong et al., 2008Hickey et al., 2011

Haplotype library imputation

• Build library of all completely phased haplotypes

• Find haplotypes in the library which can explain an individuals genotype

• Low error rates

• Computationally fast

• Useful for extremely large data sets– Strategic use

101001110111001110011100111010000001000010001111001

101100110011001110011100111010011101100100100111001

1010010101110011100111001

1010011100110011100111000

111110111011100111001110011010010000000011100111001

Phasing results simulated data

Phasing results real data

The imputation problem for a 2.5k low density chip in pigs

What information do we have to solve this problem?

• Knowledge

• Low density genotypes

• Pedigree information

• Linkage

• Linkage disequilibrium

What knowledge do we have?

• Knowledge– Inheritance is “chunkular”– Chunks of DNA are inherited together– Recombination events breaks these chunks up– Approximately 70 recombination events during meiosis– Thus approximately 150 chunks per animal– WE CALL THESE CHUNKS HAPLOTYPES

• Pedigree information

• Linkage– Family statistic– Correlation between adjacent markers within a family– Long haplotype information

• Linkage disequilibrium– Population statistic– Correlation between adjacent markers within a population– Short haplotype information

How can we use low density information?

• We are trying to impute this individual

10110122121110212101220022

True genotype

..........................

....0..............1......

Low density genotype

How can we use pedigree information?

10110122121110212101220022

11110222111111111111121021 10111121211121212211221121

....0..............1......

True genotype


MotherFather

....0.2............1.2..2.


10110122121110212101220022

True genotype


Three offspring

....0.2............1.2..2.

10220122121120212101120122

21110122121110212101220022

10120122121120212101121022

....0.2.....1......1.2..2.

How can we use linkage information?

• Two step procedure– Phase– Choose which parental haplotype was passed to offspring

• What is phasing?– Phasing strips a string of SNP into its two gametes/haplotypes– Genotypes represented as 0, 1, or 2– Each parent contributes a 0 or a 1– We use long range phasing to do this

1010011101110011100111001100010011110010101100110011

10110122121110212101220022

Paternal gamete

Maternal gamete

Genotype is the sum of the gametes

True phase

True genotype

How can we use linkage information?

10110122121110212101220022

True phase


Mother Phased

Father Phased

.1..0.2.....1......1.2..2.

1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10100111011100111001110011 0001001111001010110011001110101110101111111111111110

....0111....0........1.01.

.0..0.1.1...1.1.11..11..1.Individual Phased

True genotype


10110122121110212101220022

True phase


Mother Phased

Father Phased

.1..0.2.....1......1.2..2.

1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10100111011100111001110011 0001001111001010110011001110101110101111111111111110

....0111....0........1.01.

.0..0.1.1...1.1.11..11..1.Individual Phased

True genotype

10100111011100111001110011


10110122121110212101220022


Mother Phased

Father Phased

.1..0.2.....1......1.2..2.

1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10100111011100111001110011 0001001111001010110011001110101110101111111111111110

....0111....0........1.01.

.0..0.1.1...1.1.11.011..1.

Individual Phased

10100111011100111001110011

True phase


10110122121110212101220022


Mother Phased

Father Phased

.1..0.2.....1......1.2..2.

1010011101110011100111001100010011110010101100110011

01010111100011000110011010

10100111011100111001110011 0001001111001010110011001110101110101111111111111110

....0111....0........1.01.

.0..0.1.1...1.1.11.011..1.

Individual Phased

1010011101110011100111001100010011110010101100110011

AlphaImpute compared to Impute2.0

• Data set for comparison– PIC single line test set– 6473 animals in pedigree file– 3200 genotyped at high density– 509 genotyped at low density

• These are animals in the current generation• We know their high density genotypes• Different categories of animals (which ancestors genotyped)

– High density chip contains 4221 Snp on chromosome 1

– Low density • 15% of SNP (approx. 7500 SNP density)• 10% of SNP (approx. 5000 SNP density)• 5% of SNP (approx. 2500 SNP density)• 1% of SNP (approx. 500 SNP density)

Results PIC pig data set

0.5k LD 2.5k LD 5k LD 7.5k LD

Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2

BothParents51 0.98 0.77 0.99 0.92 1.00 0.96 1.00 0.96

SireMGS62 0.93 0.80 0.98 0.92 0.99 0.94 0.99 0.96

DamPGS47 0.96 0.79 0.98 0.92 0.99 0.95 0.99 0.96

Sire45 0.89 0.78 0.97 0.92 0.99 0.95 0.99 0.97

Dam13 0.90 0.76 0.96 0.89 0.98 0.93 0.98 0.95

Other291 0.86 0.79 0.94 0.91 0.97 0.95 0.97 0.96

Correlation is the statistic that matters

The cost and accuracy of sensible strategies

nSires = 480nDams = 11884nCandidates = 100000

60k chip = $1206k chip = $483k chip = $35384 chip = $20

Scenarios Other MGS + PGS MGD + PGD Sire Dam Candidates Individual cost Accuracy of Imputation R2

SC1 60k 60k 0 60k 0 384 ! 0.878SC2 60k 60k 384 60k 384 384 $20.58 0.929SC3 60k 60k 3k 60k 3k 384 $24.74 0.950SC4 60k 60k 6k 60k 6k 384 $26.28 0.944SC5 60k 60k 60k 60k 60k 384 $34.84 0.964SC6 60k 60k 0 60k 0 3k ! 0.968SC7 60k 60k 384 60k 384 3k ! 0.972SC8 60k 60k 3k 60k 3k 3k $35.58 0.984SC9 60k 60k 6k 60k 6k 3k $41.28 0.983SC10 60k 60k 60k 60k 60k 3k $49.84 0.993SC11 60k 60k 0 60k 0 6k ! 0.982SC12 60k 60k 384 60k 384 6k ! 0.983SC13 60k 60k 3k 60k 3k 6k ! 0.986SC14 60k 60k 6k 60k 6k 6k $48.58 0.991SC15 60k 60k 60k 60k 60k 6k $62.84 0.996SC16 60k 60k 60k 60k 60k 60k $120.00 1.000

Imputation and GEBV accuracy

Genotyping Scenario Imputation accuracy

Other PGS+MGS PGD+MGD Sire Dam Progeny 4220 74 108 70 107 184 450 3k 6k

S1 H H H H H L 0.97 0.99 1.00 S2 H 0 0 H H L 0.95 0.98 0.99 S3 H H 0 H 0 L 0.91 0.97 0.98 S4 H H L H L L 0.94 0.99 0.99 !1!

N HD Geno.

Genotyping Scenario Imputed gEBV

Accuracy

Other PGS+MGS

PGD+MGD Sire Dam Progeny 450 3k 6k

S1 2519 H H H H H L 0.94 0.97 0.97 S2 2344 H 0 0 H H L 0.89 0.95 0.96 S3 2318 H H 0 H 0 L 0.87 0.92 0.93 S4 2318 H H L H L L 0.90 0.96 0.97 !1!

HMM - Weather example

• I am locked in an office without any windows • I want to predict what the weather is each day

• Each day my office mate “Andreas” comes– But we don’t talk

• Can I extract information from the behavior of Andreas?– Andreas likes ice-cream– He eats a different number of ice-creams each day– Could I use that to predict the weather outside?

Reality underlying the data

• Data– For 30 days I record the number of ice-creams

Andreas eats

• “Biological” knowledge– Just two weather states (Sunny or Cloudy)– Weather today is correlated with weather tomorrow– Correlation dissipates over time

Weather example

• Hidden markov process

• Markov

• Hidden– See ice-cream, but really modeling the weather

In the HMM

• Parameters– Transition matrix– Emission probabilities– Output probabilities

• Algorithms– Expectation Maximisation– Forward/backward– Viterbi

Weather example

• States – k1 = Cloudy– k2 = Sunny

• Time = days

• Transition matrix = A

Cloudy SunnyCloudy 0.9 0.1Sunny 0.7 0.3

Starting values

n Ice-creams 1 2 3Cloudy

emission 0.7 0.2 0.1Sunny emission 0.1 0.2 0.7


Back to the ice-cream

• DataDay n Ice-creams

1 22 33 34 25 36 27 38 29 2

10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2

ResultsDay n Ice-creams Day ProbC ProbS

1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99

10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 1 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96

Parameters

n Ice-creams 1 2 3Cloudy

emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57


How is this imputation?Day n Ice-creams Day ProbC ProbS

1 2 1 0.06 0.942 3 2 0.00 1.003 3 3 0.00 1.004 2 4 0.00 1.005 3 5 0.00 1.006 2 6 0.00 1.007 3 7 0.00 1.008 2 8 0.01 0.999 2 9 0.01 0.99

10 3 10 0.00 1.0011 1 11 0.10 0.9012 3 12 0.00 1.0013 3 13 0.00 1.0014 1 14 0.92 0.0815 # 15 0.99 0.0116 1 16 1.00 0.0017 2 17 0.98 0.0218 1 18 1.00 0.0019 1 19 1.00 0.0020 1 20 1.00 0.0021 2 21 0.98 0.0222 1 22 1.00 0.0023 1 23 0.99 0.0124 1 24 0.95 0.0525 2 25 0.33 0.6726 3 26 0.00 1.0027 3 27 0.00 1.0028 2 28 0.00 1.0029 3 29 0.00 1.0030 2 30 0.04 0.96

Day n Ice-creams1 22 33 34 25 36 27 38 29 2

10 311 112 313 314 115 116 117 218 119 120 121 222 123 124 125 226 327 328 229 330 2

Imputation

n Ice-creams 1 2 3Cloudy emission 0.79 0.21 0.00Sunny emission 0.06 0.37 0.57

Day ProbC ProbS15 0.99 0.01

Most likely number of ice-creams Andreas eats on day 15 =

0.99× 0.79×1( )+ 0.21×2( )+ 0.00×3( )"# $% + 0.01× 0.06×1( )+ 0.37×2( )+ 0.57×3( )"# $%

=1.22

True value = 1 Ice-cream on day 15Predict on the basis of the hidden states

Now put in the genetics

– Alleles (A=0, a=1)– Genotype data (0,1, or 2)

10100111011100111001110011

11110222111111111111121021

01010111100011000110011010

The phasing problem – split diplotype into a pair of haplotypes

The founder haplotype mosaic problem

HMM for Genetics

• fastPHASE• Beagle• MaCH• Impute2• AlphaPhase

• Haplotyping and imputation

fastPHASE

• The model– Haploid gametes underly Diploid genotypes

• Hidden markov process– Haploid gametes in present population derived from

ancient founder haplotypes• Hidden process

– Founder haplotypes can be considered to be cluster medoids

– fastPHASE can be considered to be a mixture model

The genetics

• Alleles are correlated along the haploid gametes– Closer alleles are more correlated

• fastPHASE is an IBD probability model– Each allele of each gamete has a probability of

deriving from each founder haplotype

• With this information we can do lots of things– Phase– Impute– Build genomic relationship matrices

30 animal example

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10Founder1 0 0 0 0 0 0 0 0 0 0Founder2 1 1 1 1 1 1 1 1 1 1Founder3 1 0 1 0 1 0 1 0 1 0Founder4 0 1 0 1 0 1 0 1 0 1

M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 1 1 2 2 2 2 2 2 2 12 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 05 0 2 0 2 0 2 0 1 0 16 2 2 2 2 2 2 2 2 1 17 0 1 0 1 0 1 0 1 0 18 0 2 0 2 0 2 0 1 0 09 0 2 0 1 0 1 0 0 0 010 0 1 1 1 1 1 1 1 1 011 0 1 0 1 0 1 0 1 0 112 0 1 0 1 0 1 0 0 0 013 1 1 1 1 1 1 1 1 1 114 0 1 0 2 0 1 0 0 0 015 0 1 0 1 0 1 0 0 0 016 0 0 0 0 0 0 0 0 0 017 1 1 1 1 1 1 1 1 0 018 0 0 0 0 0 0 0 0 0 019 1 2 1 2 1 2 1 2 1 120 0 1 0 1 0 1 0 1 0 121 1 2 1 2 1 2 1 2 0 122 0 0 0 1 0 1 0 1 0 123 1 2 1 2 1 2 0 1 0 024 0 1 0 0 0 2 1 1 0 025 2 2 2 2 2 2 1 2 1 226 1 1 1 1 1 1 1 1 0 027 0 1 0 1 0 1 1 2 0 128 0 0 0 0 0 0 0 0 0 029 0 0 0 0 0 0 0 1 0 130 0 2 0 2 0 2 0 1 0 0

M1 M2 M3 M4 M5 M6 M7 M8 M9 M101 0 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 02 0 0 0 0 0 0 0 0 0 02 0 0 1 1 1 1 1 1 1 03 0 0 0 0 0 0 0 0 0 03 0 0 0 0 0 0 0 0 0 04 0 1 0 1 0 1 0 0 0 04 0 0 0 0 0 0 0 0 0 0

The parameters

• Slightly different to standard HMM– α, r, Θ

• α and r are partitions of the transition matrix– r = recombination rate between markers

• i.e. probability of a transition happening

– α is the frequency of each allele of each founder haplotype

• i.e. given there is a transition, where do I transition to

• Θ = the emission probability• The frequency of allele 1 in at position k in founder haplotype

j– (Emit a 1 as opposed to a 0)

HMM parameters

Theta0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.700.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.500.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.000.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

Alpha0.24 0.01 0.00 0.00 0.14 0.18 0.88 0.90 0.07 0.060.25 0.00 0.09 0.00 0.00 0.00 0.00 0.04 0.00 0.110.28 0.97 0.90 0.99 0.61 0.10 0.02 0.01 0.27 0.260.23 0.02 0.01 0.01 0.25 0.72 0.10 0.05 0.66 0.58

R0.0004 0.0004 0.0005 0.0001 0.0004 0.0017 0.0035 0.0004 0.0002

Ancestral haplotypes

ThetaF4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

Founder1 0 0 0 0 0 0 0 0 0 0

Founder2 1 1 1 1 1 1 1 1 1 1

Founder3 1 0 1 0 1 0 1 0 1 0

Founder4 0 1 0 1 0 1 0 1 0 1

Impute missing marker

• Combine output probabilities with the parameters of the model– With Theta – the emission probabilities

29 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0029 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0029 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

29 1 0.79 0.79 0.79 0.79 0.79 0.79 0.85 0.98 0.98 0.9829 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.0129 3 0.21 0.21 0.21 0.21 0.21 0.21 0.15 0.01 0.01 0.0129 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

F4 0.00 0.80 0.00 0.83 0.00 0.80 0.17 0.94 0.00 0.70F2 0.79 0.86 1.00 1.00 1.00 1.00 0.86 1.00 0.57 0.50F1 0.00 0.04 0.00 0.04 0.00 0.07 0.00 0.04 0.00 0.00F4 0.00 0.99 0.00 0.96 0.00 1.00 0.03 0.24 0.00 0.01

29 0 0 0 0 0 0 0 1 0 1• Genotype of individual 29 at marker 1

• Gamete 1 comes from founder 3

• Gamete 2 is a combination of founders 1 and 3

• Founders 1 and 3 emit a 0

• True genotype is a 0

Segregation analysis and haplotype library imputation

• Individual’s are densely, sparsely, or not genotyped

• Pedigree information available

• Single locus segregation analysis for each SNP

• Match each pair of haplotypes with low density genotypes and genotype probabilities

1 2

7 8 119 10

13

6543

14 15

Haplotype library for population

Genotyping strategy in terms of high density, low density and not genotyped

How we measure it

• Correlation between imputed dosage and true genotypes• R-squared

• Correlation within a family• Relates to the accuracy of imputing the Mendelian sampling term

• Don’t use the percentage of genotypes correctly imputed

How we measure it

• Percentage correct is bad for SNP• And will be worse for sequence data

658 WWW.CROPS.ORG CROP SCIENCE, VOL. 52, MARCH–APRIL 2012

0.32 for LDP-98.5% while it only marginally increased from 0.87 for LDP-75% to 0.90 for LDP-50%.

Effect of Minor Allele Frequency on Accuracy of ImputationWe classify genotypes into groups of MAF according to the following intervals: [0,0.025], [0.025,0.05], [0.05,0.075], [0.075,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4], and [0.4,0.5]. For each of these groups, we computed the average MAF, %Cor-rect, and rT,I in each of the imputation scenarios. Results are graphically displayed in Fig. 2 with %Correct and rT,I show-ing opposite patterns. When the MAF is very small, miss-ing genotypes are almost certain to be homozygous for the common allele and therefore %Correct is almost 100; this will happen even if one uses a naive imputation method (e.g., replacing missing genotypes with the most likely geno-type). On the other hand, the prior uncertainty about miss-ing genotypes increases with MAF and therefore, %Correct decreases with increasing MAF. However, as stated, %Cor-rect does not measure how much is gained with the imputa-tion algorithm over naive imputation procedures.

On the other hand, the rT,I measures how much is gained relative to naive imputation procedures and our results indicate that, with very low-density platforms, rT,I is lower for genotypes with extreme allele frequency. This occurs because lines carrying alleles that are present with very low frequency in the population share haplotypes for those genotypes with only a very small number of lines in the population. Therefore, imputation accuracy of those genotypes is poorer, especially when large segments of the chromosome are not genotyped, as it happens in LH

scenarios where the LDP contain only 1% of the geno-types of the HDP. This eff ect of allele frequency on rT,I was apparent only for genotypes with MAF < 0.1 and in LDP with more than 84% of masked genotypes.

Patterns of Local Linkage Disequilibrium and Its Effect on Imputation AccuracyFor every genotype in our HDP we quantifi ed local LD by means of 2

EMR . The distribution of this statistic was bimodal with most genotypes showing extremely low levels of local LD (53.8% of the genotypes had an estimated 2

EMR with adjacent genotypes smaller than 0.1) and many genotypes showing extremely high levels of LD with adjacent geno-types (10.6% of genotypes had 2

EMR with adjacent genotypes greater than 0.9). In the range defi ned by [ ]2 0.1,0.9EMR � the distribution of genotypes was rather uniform.

The relatively large proportion of genotypes having extremely low LD with adjacent genotypes is likely caused by a combination of several factors, including the varying genotype density across regions of the genome, the vary-ing levels of LD across the maize genome, the complexity of our dataset, which includes lines from diff erent popula-tions and therefore may give rise to complex LD patterns, and fi nally, some genotypes having extremely low local LD simply because of mapping errors. Regardless of what was the relative importance of each of these factors, the extremely low levels of LD observed for a large proportion of genotypes in our dataset is certainly limiting the level of imputation accuracy that one can achieve with this dataset.

Using 2EMR we classify genotypes into groups of 2

EMR and computed the average rT,I by level of 2

EMR for each

Figure 2. Average proportion of genotypes correctly imputed (left panel) and correlation between imputed and true genotype (right panel) versus range of minor allele frequency of the marker being imputed. The vertical dashed lines give the boundaries of minor allele frequency used to group markers. RAND-5%, a scenario where 5% of the markers were randomly chosen, masked, and subsequently imputed; LDP-x%, a LDP with x% of the genotypes masked.

heuristics for phasing and imputation ......why is genomic selection attractive? •directly...

Documents