estimation of allele frequencies from high-coverage genome ... › content › genetics › 182 ›...

12
Copyright Ó 2009 by the Genetics Society of America DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects Michael Lynch 1 Department of Biology, Indiana University, Bloomington, Indiana 47405 Manuscript received January 6, 2009 Accepted for publication March 6, 2009 ABSTRACT A new generation of high-throughput sequencing strategies will soon lead to the acquisition of high-coverage genomic profiles of hundreds to thousands of individuals within species, generating unprecedented levels of information on the frequencies of nucleotides segregating at individual sites. However, because these new technologies are error prone and yield uneven coverage of alleles in diploid individuals, they also introduce the need for novel methods for analyzing the raw read data. A maximum-likelihood method for the estimation of allele frequencies is developed, eliminating both the need to arbitrarily discard individuals with low coverage and the requirement for an extrinsic measure of the sequence error rate. The resultant estimates are nearly unbiased with asymptotically minimal sampling variance, thereby defining the limits to our ability to estimate population-genetic parameters and providing a logical basis for the optimal design of population-genomic surveys. V ERY soon, the field of population genomics will be overwhelmed with enormous data sets, including those generated by the ‘‘1000-genome’’ projects being planned for human, Drosophila, and Arabidopsis. With surveys of this magnitude, it should be possible to go well beyond summary measures of nucleotide diversity for large genomic regions to refined estimates of allele- frequency distributions at specific nucleotide sites. Such information will provide a deeper understanding of the mechanisms of evolution operating at the geno- mic level, as the properties of site-frequency distributions are functions of the relative power of mutation, drift, selection, and historical demographic events (which transiently modify the power of drift) (Kimura 1983; Ewens 2004; Keightley and Eyre-Walker 2007; McVean 2007). Accurate estimates of allele frequencies are also central to association studies for QTL mapping (Lynch and Walsh 1998), to the development of databases for forensic analysis (Weir 1998), and to the ascertainment of pairwise individual relationships (Weir et al. 2006). If the promise of high-throughput population-geno- mic data is to be fulfilled, a number of analytical challenges will need to be met. High-throughput sequencing strategies lead to the unbalanced sampling of parental alleles (in nonselfing diploid species) and also introduce errors that can result in uncertainty about individual genotypic states ( Johnson and Slatkin 2007; Hellmann et al. 2008; Lynch 2008; Jiang et al. 2009). Unless they are statistically accounted for, both types of problems will lead to elevated estimates of low- frequency alleles, especially singletons, potentially lead- ing to exaggerated conclusions about purifying selection. Such aberrations are of concern because, for economic reasons, many plans for surveying large numbers of genomes envision the pursuit of relatively low (23) depth-of-coverage sequence for each individual, a sam- pling design that will magnify the variance of parental- allele sampling, while also rendering the error-correction problem particularly difficult (Lynch 2008). Simple ad hoc strategies for dealing with light-cover- age data sets include the restriction of site-specific analyses to individuals that by chance have adequate coverage for a high probability of biallelic sampling (e.g.,Jiang et al. 2009). However, such an approach can result in the discarding of large amounts of data, increasing the lower bound on the frequency of alleles that can be determined (as a result of the reduction in sample size). In addition, although methods exist for flagging and discarding low-quality reads (Ewing and Green 1998; Ewing et al. 1998; Huse et al. 2007), such treatment is unable to eliminate other sources of error. With low-coverage data, singleton reads cannot be dis- carded without inducing significant bias, as such treat- ment will also lead to the censoring of true heterozygotes with unbalanced parental-allele sampling. Thus, there is a clear need for novel methods for efficiently estimating nucleotide frequencies at individ- ual sites, not only to maximize the information that can be extracted from population-level samples, but also to guide in the design of such surveys. In the latter context, questions remain as to the relative merits of increasing Supporting information is available online at http://www.genetics.org/ cgi/content/full/genetics.109.100479/DC1. 1 Author e-mail: [email protected] Genetics 182: 295–301 (May 2009)

Upload: others

Post on 06-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

Copyright � 2009 by the Genetics Society of AmericaDOI: 10.1534/genetics.109.100479

Estimation of Allele Frequencies From High-CoverageGenome-Sequencing Projects

Michael Lynch1

Department of Biology, Indiana University, Bloomington, Indiana 47405

Manuscript received January 6, 2009Accepted for publication March 6, 2009

ABSTRACT

A new generation ofhigh-throughput sequencing strategies will soon lead to the acquisition of high-coveragegenomic profiles of hundreds to thousands of individuals within species, generating unprecedented levels ofinformation on the frequencies of nucleotides segregating at individual sites. However, because these newtechnologies are error prone and yield uneven coverage of alleles in diploid individuals, they also introduce theneed for novel methods for analyzing the raw read data. A maximum-likelihood method for the estimation ofallele frequencies is developed, eliminating both the need to arbitrarily discard individuals with low coverageand the requirement for an extrinsic measure of the sequence error rate. The resultant estimates are nearlyunbiased with asymptotically minimal sampling variance, thereby defining the limits to our ability to estimatepopulation-genetic parameters and providing a logical basis for the optimal design of population-genomicsurveys.

VERY soon, the field of population genomics will beoverwhelmed with enormous data sets, including

those generated by the ‘‘1000-genome’’ projects beingplanned for human, Drosophila, and Arabidopsis. Withsurveys of this magnitude, it should be possible to gowell beyond summary measures of nucleotide diversityfor large genomic regions to refined estimates of allele-frequency distributions at specific nucleotide sites.Such information will provide a deeper understandingof the mechanisms of evolution operating at the geno-mic level, as the properties of site-frequency distributionsare functions of the relative power of mutation, drift,selection, and historical demographic events (whichtransiently modify the power of drift) (Kimura 1983;Ewens 2004; Keightley and Eyre-Walker 2007;McVean 2007). Accurate estimates of allele frequenciesare also central to association studies for QTL mapping(Lynch and Walsh 1998), to the development ofdatabases for forensic analysis (Weir 1998), and to theascertainment of pairwise individual relationships(Weir et al. 2006).

If the promise of high-throughput population-geno-mic data is to be fulfilled, a number of analyticalchallenges will need to be met. High-throughputsequencing strategies lead to the unbalanced samplingof parental alleles (in nonselfing diploid species) andalso introduce errors that can result in uncertainty aboutindividual genotypic states ( Johnson and Slatkin

2007; Hellmann et al. 2008; Lynch 2008; Jiang et al.

2009). Unless they are statistically accounted for, bothtypes of problems will lead to elevated estimates of low-frequency alleles, especially singletons, potentially lead-ing to exaggerated conclusions about purifying selection.Such aberrations are of concern because, for economicreasons, many plans for surveying large numbers ofgenomes envision the pursuit of relatively low (�23)depth-of-coverage sequence for each individual, a sam-pling design that will magnify the variance of parental-allele sampling, while also rendering the error-correctionproblem particularly difficult (Lynch 2008).

Simple ad hoc strategies for dealing with light-cover-age data sets include the restriction of site-specificanalyses to individuals that by chance have adequatecoverage for a high probability of biallelic sampling(e.g., Jiang et al. 2009). However, such an approach canresult in the discarding of large amounts of data,increasing the lower bound on the frequency of allelesthat can be determined (as a result of the reduction insample size). In addition, although methods exist forflagging and discarding low-quality reads (Ewing andGreen 1998; Ewing et al. 1998; Huse et al. 2007), suchtreatment is unable to eliminate other sources of error.With low-coverage data, singleton reads cannot be dis-carded without inducing significant bias, as such treat-ment will also lead to the censoring of true heterozygoteswith unbalanced parental-allele sampling.

Thus, there is a clear need for novel methods forefficiently estimating nucleotide frequencies at individ-ual sites, not only to maximize the information that canbe extracted from population-level samples, but also toguide in the design of such surveys. In the latter context,questions remain as to the relative merits of increasing

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1.

1Author e-mail: [email protected]

Genetics 182: 295–301 (May 2009)

Page 2: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

the number of individuals sampled vs. the depth ofsequence coverage per individual, e.g., a 1000-genomeproject involving 23 coverage per individual, as op-posed to a 500-genome project with 43 coverage, etc.Because many genetic analyses rely on low-frequencymarkers (SNPs), there is also a need to know the limitsto our ability to estimate the frequencies of rare alleles,particularly in the face of error rates of comparable oreven higher levels.

As a first step toward addressing these issues, amaximum-likelihood (ML) method is developed for theestimation of allele frequencies at individual nucleotidesites. This method is shown to behave optimally in thatthe estimates are nearly unbiased with sampling variancesasymptotically approaching the expectation under puregenotypic sampling at a high depth of sequence coverage.Moreover, the proposed approach eliminates the needfor an extrinsic measure of the read-error rate, using thedata themselves to remove this nuisance parameter,thereby reducing potential biases resulting from site-specific variation in error-generating processes.

THEORY AND ANALYSIS

Throughout, we assume that there are no more thantwo actual nucleotides segregating per site, a situationthat experience has shown to be almost universally trueat the population level [average nucleotide heterozy-gosity in most diploid species is generally well below 0.1,even at neutral sites (Lynch 2007)]. It is also assumedthat prior to analysis the investigator has properlyassembled the sequence reads to ensure that paralogousregions (including mobile elements and copy-numbervariants) that might aggregate to the same sites havebeen removed from the analysis. This is, of course, aconsideration for all methods of sequence analysis, butwith high-coverage projects involving small sequencereads, it is more manageable because unusually highdepths of coverage can flag problematical regions.

For the classical situation in which the genotypes of Nindividuals have been determined without error, assum-ing a population in Hardy–Weinberg equilibrium, thelikelihood that the frequency of the leading allele(denoted as 1) is equal to p is simply

LðpÞ ¼ K ½p2�N11 ½2pð1� pÞ�N12 ½ð1� pÞ2�N22 ; ð1Þ

where N11, N12, and N22 are the numbers of times thethree genotypes appear in the sample, and K is aconstant that accounts for the multiplicity of orderingsof individuals within the sample (which has no bearingon the likelihood analysis). The ML estimate of p, whichmaximizes L(p), has the well-known analytical solution(which also applies in the face of Hardy–Weinbergdeviations) p ¼ ðN11 1 0:5N12Þ=N (Weir 1996).

Unfortunately, uncertainty in the identity of individ-ual genotypes, resulting from incomplete sampling and

erroneous reads, substantially complicates the situationin a random-sequencing project, as there are no longerjust three categories of individuals. Rather, each site ineach individual will be characterized by a specific read-frequency array (the number of reads for A, C, G, andT). This necessitates that the full likelihood be brokenup into two components, the first associated with thesampling of reads within individuals and the secondassociated with the sampling of individual genotypes (asabove). Assuming no more than two alleles per site, thedata can then be further condensed as the analysisproceeds.

For example, letting 1 and 2 denote the putative majorand minor nucleotides at a site in the population (withrespective frequencies above and below 0.5) and 3denote the error bin containing incorrect reads to theremaining two nucleotides, for any candidate major/minor nucleotide combination, each individual sur-veyed at the site can be represented by an array (n1, n2,n3), where the three entries denote the numbers of readsat this site that coincide with the candidate majornucleotide, the minor nucleotide, and an erroneousread. Given the observed reads at the site for N indivi-duals, our goal is to identify the major/minor nucleo-tides at the site and to estimate their frequencies. Thereare 12 possible arrangements of major and minornucleotides, A/C, A/G, . . . , G/T, and the likelihoodsof each of these alternatives must be consideredsequentially.

We again start with the assumption of a populationin Hardy–Weinberg equilibrium and further assume ahomogeneous error distribution, such that each nucle-otide is equally likely to be recorded erroneously as anyother nucleotide with probability e/3 (the total errorrate per site being e). The composite-likelihood func-tion for a particular read configuration is the sum ofthree terms, each the product of the probability of aparticular genotype and the probability of the readconfiguration conditional on being that genotype,

Pðn1; n2; n3 jn; p; eÞ

¼ p2fe n2; n3; n;e

3;

2e

3

� �1 ð1� pÞ2fe n1; n3; n;

e

3;

2e

3

� �

1 2pð1� pÞfeðn3; n;2e

3Þpðn1; n1 1 n2; 0:5Þ:

ð2ÞThe first two terms in this expression denote thecontributions to the likelihood assuming the genotypeis homozygous for the major or minor alleles, with thefunctions fe(x, y; n, e/3, 2e/3) denoting joint probabil-ities of x errors to the alternative allele and y errors to theremaining two nonexistent alleles, assumed here tofollow a trinomial distribution with n ¼ n1 1 n2 1 n3.The third term accounts for heterozygotes, with fe(n3;n, 2e/3) being the probability of n3 errors to non-1/2nucleotides, with the total error rate being reduced byone-third to discount internal (unobservable) 142

296 M. Lynch

Page 3: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

errors, and p(n1; n1 1 n2, 0.5) being the probability ofsampling n1 copies of the major allele among theremaining (n� n3) reads consistent with the designatedgenotype (Lynch 2008).

It is informative to start with the simple situation inwhich each individual is sequenced to the same depth ofcoverage (n), in which case the log likelihood for theentire data set, conditional on a particular candidatemajor/minor nucleotide pair, major-allele frequency(p), and error rate (e), is

L ¼X

N ðn1; n2; n3Þ � ln½Pðn1; n2; n3 jn; p; eÞ�; ð3Þ

where the unit of analysis is N(n1, n2, n3), the number ofindividuals sampled with read configuration (n1, n2, n3),and the summation is over all observed configurations.For sites with n3 coverage, there are 1 1 [n(n 1 3)/2]possible read-array configurations for each major/minor nucleotide combination satisfying n ¼ n1 1 n2 1

n3. To obtain the ML solution, this expression needs tobe evaluated over all possible combinations of majorand minor nucleotides to determine the joint combi-nation of major/minor allele identities, major-allelefrequency (p), and error rate (e) that maximizes L.

To evaluate the behavior of Equations 2 and 3,stochastic simulations were performed by drawingrandom samples of N individuals from populations inHardy–Weinberg equilibrium, with a constant sequencecoverage per site n. Errors were randomly assigned toeach read with probability e, such that each nucleotidehad a probability of being erroneously recorded as anyother with probability e/3. A relatively high error rate ofe¼ 0.01 was assumed to evaluate the situation under theworst possible conditions. The ML estimates for eachsimulation were obtained by thorough grid searches ofthe full parameter space for p and e for each possiblecombination of major and minor alleles, for 250 to 500replications.

The results of these analyses indicate that, providedthe coverage is .13, the proposed method yieldsestimates of p that are essentially unbiased, with thevariance among replicate samples rapidly approachingthe expectation based on individual sampling alone,p(1 � p)/(2N), once the coverage exceeds 43 (Figure1). Small downward biases in the frequency estimatesfor rare alleles can arise when the sample size is lowenough to cause rare-allele sampling to be extremelysporadic and the coverage is also low enough to causesporadic sampling of errors. The problem is most severewith sequences with 13 coverage, as there is then noinformation within individuals on heterozygosity. How-ever, even with 23 coverage, this bias is minor enough tobe of little concern in most applications, as when it islikely to be of quantitative importance, it is also over-whelmed by the sampling variance. Moreover, as isdemonstrated below, under the usual situation of vari-able coverage, even 13 sites are less problematic, as they

supplement the information from sites with highercoverage.

The remaining issue is the extension of the aboveapproach to sites with variable coverage among individ-uals. In this case, the absolute probabilities of thevarious configuration arrays are now functions of theprobabilities of the various coverage levels, but the latterinfluence only the arbitrary constant in the log likeli-hood, so the approaches outlined above can be usedwithout further modification, other than the summa-tion over all configurations at all coverage levels inEquation 3. Evaluation of this procedure by computersimulation, assuming a Poisson distribution of readnumbers across individuals and including 13-coveredsites in the analysis, yielded no substantive differencesfrom the patterns illustrated in Figure 1. The estimatoris essentially unbiased over all estimable allele frequen-

Figure 1.—Biases (top plot) and coefficients of samplingvariation reported as SDðp Þ=p (bottom plot) of ML allele-frequency estimates obtained from stochastic computer simu-lations. N is the number of individuals sampled, and n is thenumber of reads per site per individual (here assumed to beconstant). The solid line in the top plot denotes the line ofequivalence (i.e., an absence of bias), whereas those in the bot-tom plot assume the asymptotic sampling variance, p(1 � p)/(2N). A high error rate of e ¼ 0.01 is assumed.

Allele-Frequency Estimation 297

Page 4: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

cies (minor-allele frequencies .1/2N), and the sam-pling variance of the estimate p asymptotically ap-proaches p(1 � p)/(2N) at high coverages.

We may now inquire as to the optimal samplingstrategy for accurate allele-frequency estimates under afixed sequencing budget, which is assumed here to beproportional to T ¼ Nn, the expected number of basessequenced per site in the sample (with N individuals,each sequenced to average depth of coverage n). Thiseffort function assumes that the bulk of the cost of dataacquisition involves sequencing rather than the prepa-ration/procurement of individual samples. Low cover-age will magnify the problem of sampling variancewithin individuals, while also resulting in a fraction ofindividuals that are entirely lacking in observations atindividual sites (�e�n under a Poisson distribution ofcoverages). On the other hand, under the constraint ofconstant Nn, increased sequencing depth per individualwill result in a reduction in the number of individualsthat can be assayed, thereby reducing the ability todetect rare alleles.

Assuming error rates in the range of e¼ 0.001–0.01, ifallele-frequency estimation is the goal, it appears thatthere is often no advantage to sequencing above 13

coverage, except for alleles with frequencies on theorder of e, where the sampling variance of p isminimized at 1–33 coverage (Figure 2). Indeed, underthe assumption of constant Nn as small as 500, for mostallele frequencies, the sampling variance of p actuallycontinues to decline at coverages well below 13 pro-vided e , 0.01, and in no case is there a gain in powerabove 23 coverage.

DISCUSSION

Contrary to previous studies of nucleotide variation inwhich careful enough attention has been given toindividual sequences that genotypes are unambiguous,with the new generation of high-throughput sequenc-ing strategies, quality is sacrificed for enormous quan-tity. Genotypes are then incompletely penetrant, in thatone or both alleles need not be revealed in a subset ofindividuals, and those that are sampled are sometimesrecorded erroneously. The method proposed hereinprovides a logical and efficient means for estimatingallele frequencies in the face of these problems.Although the applications of the model in this articleto simulated data involved brute-force searches of thefull range of potential parameter space (the code forwhich is available as supporting information, File S1 andFile S2), iterative methods will likely be desirable forlarge-scale studies involving millions of sites, and somederivations essential to such approaches are provided inthe appendix.

Two primary advantages of the proposed method are(1) its use of the full data set, which eliminates the needfor arbitrary decisions on cutoffs for adequate coverage,

and (2) its ability to internally separate the incidence ofsequence errors from true variation, which eliminatesthe necessity of relying on external measures of theread-error rate that do not incorporate all sources oferror. As the error rates assumed in a number of thesimulations in this study are presumably near the upperend of what will occur in ultra-high-throughput meth-ods now under development [although likely still toolow in the case of ancient DNA samples (Briggs et al.2007; Gilbert et al. 2008)], the fact that the estimatesare nearly unbiased with asymptotic sampling variancesdefined by the number of individuals alone suggeststhat the ML method provides an optimal solution to theascertainment of frequencies of both common and rarealleles, just as it does in the conventional situation wheregenotypes are assumed to be known without error

Figure 2.—Average sampling standard deviations for esti-mates of nucleotide frequencies (p ) for rare alleles, obtainedfrom computer simulations (250–1000 replicate samples foreach set of conditions) for the situation in which individualsare subject to variable depth-of-coverage sequencing, givenfor two error rates (solid and open symbols denote e ¼0.001 and 0.01, respectively). The left three symbols for eachplot denote average coverages of 0.1, 0.25, and 0.5 per site.For each allele frequency, results in the top plot are givenfor a fixed number of 5000 sequences per site over the entiresample, i.e., T ¼ 5000 ¼ Nn, such that a doubling of the num-ber of individuals is balanced by a 50% reduction in the se-quence coverage per site. For the bottom plot, T ¼ 500.

298 M. Lynch

Page 5: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

(Weir 1996). Moreover, the fact that the proposedapproach provides joint estimates of p and e on a site-by-site basis provides a potentially strong advantage insituations where the error rate might be heterogeneous,e.g., dependent on the total sequence context.

With this methodology in hand, it should be possibleto ascertain the form of the site-frequency spectrum forbroad classes of genomic positions down to a very finelevel. Indeed, provided the depth of coverage per site issufficiently high, the proposed method has the poten-tial to yield accurate allele-frequency estimates evenwhen p is smaller than the error rate, as is often the casewith disease-associated alleles. Thus, with large enoughsample sizes of individuals, there are no significantbarriers to estimating allele frequencies >0:01.

The ML allele-frequency estimator extends previousmethods for estimating nucleotide heterozygosity (p)from high-throughput data from one or a small numberof individuals ( Johnson and Slatkin 2007; Hellmann

et al. 2008; Lynch 2008; Jiang et al. 2009). For a givengenomic region sequenced from multiple individualswithout error, p is generally estimated as 2pð1�p Þð2N � 1Þ=ð2N Þ averaged over all sites, the termscontaining 2N correcting for the bias associated withindividual sampling, which induces variance in theestimate of p equal to p(1 � p)/(2N) (Nei 1978). Itcan be shown that when sequence errors are present,the sampling variance of p is elevated to �[1 1 (4e/3)2]p(1� p)/(2N) 1 e(1� 4p)2/(3Nn), but with e>0:01and N ?10, heterozygosities per nucleotide site will stillbe adequately estimated as 2pð1� p Þ.

A number of variants of the model assumed hereincan be readily implemented. For example, it is straight-forward to allow for non-Hardy–Weinberg conditions.For two alleles 1 and 2, this simply requires the in-corporation of two unknown genotype frequencies (P11

and P12), with the third being constrained to be (1� P11�P12) and hence the estimation of just one additionalunknown (Weir 1996). A simple likelihood-ratio test,contrasting the fit from the full model with that underHardy–Weinberg assumptions, can then be used toevaluate whether sites are in Hardy–Weinberg equilib-rium. By a similar extension, it is possible to test forinterpopulation divergence by evaluating whether thelikelihood of the full data set for multiple populationsamples is significantly improved by allowing for pop-ulation-specific estimates of allele frequencies.

The assumption of a homogeneous site-specific errorprocess can also be relaxed by allowing for unique errorrates for any major/minor allele combination. Althoughthere are 12 possible types of single-base substitutionerrors, for any particular major/minor nucleotide com-bination (under the two-allele model), a maximum of 4are relevant to the specific likelihood expression: thereciprocal errors between the major and minor allelesand the movement of each of the two true nucleotidesinto the error bin. Again, the necessity of these types

of embellishments can be evaluated by likelihood-ratiotests.

The overall analysis indicates that unless there is aneed for specific information on each individual, fromthe standpoint of procuring accurate population esti-mates of allelic frequencies (and other associatedparameters), there is little advantage in pursuing high-coverage data per individual. Indeed, given a fixed seq-uencing budget, an overemphasis on depth of coveragewill result in a reduction in the accuracy of population-level parameters, as the optimal design for estimatingmost allele frequencies employs an average coverage of,13, even though this magnifies the fraction of indi-viduals with no data. Of course, if the ultimate goal is theprocurement of accurate genotypic profiles for individ-uals (as, for example, in association-mapping studies),there is no substitute for relatively high coverages.

Finally, although the focus of this article has beenon the development of an optimally efficient methodfor extracting allele-frequency estimates from high-throughput sequencing projects, the utility of any speci-fic application will ultimately depend on the quality ofthe sequence assembly prior to analysis. The complexgenomes of multicellular organisms are typically ladenwith repetitive DNAs, duplicate genes, and mobile ele-ments, which will misassemble to a degree declining withthe evolutionary distances between paralogous sequen-ces. Prescreening of the data for regions of exception-ally high depth of coverage, combined with the maskingof duplicated DNA (when a reference genome is avail-able), will aid in the elimination of such problematicalsequences. An additional potential source of error maybe a bias in assembling reads to reference genomes, anissue that is likely to become of diminishing importancewith increasing read lengths.

Two anonymous reviewers provided helpful comments on thismanuscript. This work was funded by National Science Foundationgrant EF-0827411, National Institutes of Health grant GM36827, andMetaCyte funding from the Lilly Foundation.

LITERATURE CITED

Briggs, A. W., U. Stenzel, P. L. Johnson, R. E. Green, J. Kelso et al.,2007 Patterns of damage in genomic DNA sequences from aNeanderthal. Proc. Natl. Acad. Sci. USA 104: 14616–14621.

Edwards, A. W. F., 1972 Likelihood. Cambridge University Press, NewYork.

Ewens, W., 2004 Mathematical Population Genetics, Ed 2. Springer-Verlag, New York.

Ewing, B., and P. Green, 1998 Base-calling of automated sequencertraces using phred. II. Error probabilities. Genome Res. 8: 186–194.

Ewing, B., L. Hillier, M. C. Wendl and P. Green, 1998 Base-callingof automated sequencer traces using phred. I. Accuracy assess-ment. Genome Res. 8: 175–185.

Gilbert, M. T., D. L. Jenkins, A. Gotherstrom, N. Naveran, J. J.Sanchez et al., 2008 DNA from pre-Clovis human coprolitesin Oregon, North America. Science 320: 786–789.

Hellmann, I., Y. Mang, Z. Gu, P. Li, F. M. De La Vega et al.,2008 Population genetic analysis of shotgun assemblies of genomicsequence from multiple individuals. Genome Res. 18: 1020–1029.

Huse, S. M., J. A. Huber, H. G. Morrison, M. L. Sogin and D. M.Welch, 2007 Accuracy and quality of massively parallel DNApyrosequencing. Genome Biol. 8: R143.

Allele-Frequency Estimation 299

Page 6: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

Jiang, R., S. Tavare and P. Marjoram, 2009 Population genetic in-ference from resequencing data. Genetics 181: 187–197.

Johnson, P. L., and M. Slatkin, 2007 Accounting for bias from se-quencing error in population genetic estimates. Mol. Biol. Evol.25: 199–206.

Keightley, P. D., and A. Eyre-Walker, 2007 Joint inference of thedistribution of fitness effects of deleterious mutations and pop-ulation demography based on nucleotide polymorphism fre-quencies. Genetics 177: 2251–2261.

Kimura, M., 1983 The Neutral Theory of Molecular Evolution. Cam-bridge University Press, Cambridge, UK.

Lynch, M., 2007 The Origins of Genome Architecture. Sinauer Associ-ates, Sunderland, MA.

Lynch, M., 2008 Estimation of nucleotide diversity, disequilibriumcoefficients, and mutation rates from high-coverage genome-sequencing projects. Mol. Biol. Evol. 25: 2421–2431.

Lynch, M., and B. Walsh, 1998 Genetics and Analysis of QuantitativeTraits. Sinauer Associates, Sunderland, MA.

McVean, G., 2007 The structure of linkage disequilibrium around aselective sweep. Genetics 175: 1395–1406.

Nei, M., 1978 Estimation of average heterozygosity and geneticdistance from a small number of individuals. Genetics 89:583–590.

Weir, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunder-land, MA.

Weir, B. S., 1998 Statistical methods employed in evaluation of sin-gle-locus probe results in criminal identity cases. Methods Mol.Biol. 98: 83–96.

Weir, B. S., A. D. Anderson and A. B. Hepler, 2006 Genetic relat-edness analysis: modern data and new challenges. Nat. Rev.Genet. 7: 771–780.

Communicating editor: R. W. Doerge

APPENDIX

Although site-specific allele frequencies and error rates can be determined by solving the likelihood function(Equation 3), over the full range of parameter space and searching for the global maximum, numerical methods suchas Newton–Raphson iteration (Edwards 1972) may provide a more time-economical solution when large numbers ofsites are to be assayed. For any candidate pair of major/minor nucleotides, the ML solution satisfies the two equationsobtained by setting the partial derivatives of the likelihood function equal to zero,

@L

@p¼XN123 � @P123=@p

P123;

@L

@e¼XN123 � @P123=@e

P123;

where N(n1, n2, n3) and P(n1, n2, n3) are abbreviated as N123 and P123, respectively, with P123, the likelihood of the data,being defined as Equation 2. For the model presented in the text (two alleles, with homogeneous error rates across allnucleotides),

@P123

@p¼ 2½pfe1

1 ðp � 1Þfe21 ð1� 2pÞfe3

pn1 �;

@P123

@e¼ p2fe1

n2 1 n3

e� n1

1� e

� �1 ð1� pÞ2fe2

n1 1 n3

e� n2

1� e

� �

1 2pð1� pÞfe3pn1

n3

e� ð2=3Þðn1 1 n2Þ

1� 2e=3

� �;

where the terms fe1, fe2

, fe3, and pn1

, which are functions of n1, n2, and n3, are notationally abbreviated from their use inEquation 2 by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their standard errors from thecurvature of the likelihood surface) will generally require the second derivatives,

@2L=@p2 ¼XN123fP123@

2P123=@p2 � ½@P123=@p�2gP 2

123

;

@2L=@e2 ¼XN123fP123@

2P123=@2e� ½@P123=@e�2g

P 2123

;

@2L=ð@p@eÞ ¼XN123fP123@

2P123=ð@p@eÞ � ½@P123=@p�½@P123=@e�gP 2

123

;

where

300 M. Lynch

Page 7: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

@2P123

@2p¼ 2½fe1

1 fe2� 2fe3

pn1 �;

@2P123

@2e¼ p2fe1

n2 1 n3

e� n1

ð1� eÞ

�2

��

n2 1 n3

e2 1n1

ð1� eÞ2� �� �

1 ð1� pÞ2fe2

n1 1 n3

e� n2

1� e

� �2

� n1 1 n3

e2 1n2

ð1� eÞ2� �� �

1 2pð1� pÞfe3pn1

n3

e� ð2=3Þðn1 1 n2Þ

ð1� 2e=3Þ

� �2

� n3

e2 1ð2=3Þ2ðn1 1 n2Þð1� 2e=3Þ2

� �� �;

@2P123=ð@p@eÞ ¼ 2 pfe1

n2 1 n3

e� n1

1� e

� �1 ðp � 1Þfe2

n1 1 n3

e� n2

1� e

� ��

1 ð1� 2pÞfe3pn1

n3

e� ð2=3Þðn1 1 n2Þ

ð1� 2e=3Þ

� ��:

Allele-Frequency Estimation 301

Page 8: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

Supporting Information

http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1

Estimation of Allele Frequencies From High-Coverage Genome-

Sequencing Projects

 

Michael Lynch

Copyright © 2009 by the Genetics Society of America

DOI: 10.1534/genetics.109.100479 

Page 9: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

Supporting Information

http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1

Estimation of Allele Frequencies From High-Coverage Genome-

Sequencing Projects

 

Michael Lynch

Copyright © 2009 by the Genetics Society of America

DOI: 10.1534/genetics.109.100479 

Page 10: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

13

APPENDIX

Although site-specific allele frequencies and error rates can be determined by solving the likeli-

hood function, Equation (3), over the full range of parameter space and searching for the global

maximum, numerical methods such as Newton-Raphson iteration (Edwards 1972) may provide a

more time-economical solution when large numbers of sites are to be assayed. For any candidate

pair of major/minor nucleotides, the ML solution satisfies the two equations obtained by setting

the partial derivatives of the likelihood function equal to zero,

∂L/∂p =∑ N123 · ∂P123/∂p

P123,

∂L/∂ε =∑ N123 · ∂P123/∂ε

P123,

where N(n1, n2, n3) and P (n1, n2, n3) are abbreviated as N123 and P123, respectively, with P123,

the likelihood of the data, being defined as Equation (2). For the model presented in the text

(two alleles, with homogeneous error rates across all nucleotides),

∂P123/∂p = 2[ pφe1 + (p− 1)φe2 + (1− 2p)φe3pn1 ],

∂P123/∂ε = p2φe1[(n2 + n3)/ε− n1/(1− ε)] + (1− p)2φe2[ (n1 + n3)/ε− n2/(1− ε) ]

+ 2p(1− p)φe3pn1[ (n3/ε)− (2/3)(n1 + n2)/(1− 2ε/3) ],

where the tems φe1, φe2, φe3, and pn1, which are functions of n1, n2, and n3, are notationally

abbreviated from their use in Equation (2) by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their

standard errors from the curvature of the likelihood surface) will generally require the second

derivatives,

∂2L/∂p2 =∑ N123

{P123∂

2P123/∂p2 − [∂P123/∂p]2}

P 2123

,

∂2L/∂ε2 =∑ N123

{P123∂

2P123/∂2ε− [∂P123/∂ε]2}

P 2123

,

∂2L/(∂p∂ε) =∑ N123{P123∂

2P123/(∂p∂ε)− [∂P123/∂p][∂P123/∂ε]}P 2

123

,

Ruth
New Stamp
Ruth
Typewritten Text
Ruth
Typewritten Text
Ruth
Typewritten Text
Ruth
Typewritten Text
Ruth
File s1
Ruth
page 2
Page 11: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

14

where

∂2P 123/∂2p = 2[φe1 + φe2 − 2φe3pn1],

∂2P 123/∂2ε = p2φe1

{[(n2 + n3)/ε− n1/(1− ε)]2 − [(n2 + n3)/ε2 + n1/(1− ε)2]

}+ (1− p)2φe2

{[(n1 + n3)/ε− n2/(1− ε)]2 − [(n1 + n3)/ε2 + n2/(1− ε)2]

}+ 2p(1− p)φe3pn1

{[(n3/ε)− (2/3)(n1 + n2)/(1− 2ε/3)]2 − [(n3/ε2) + (2/3)2(n1 + n2)/(1− 2ε/3)2]

},

∂2P 123/(∂p∂ε) = 2 {pφe1[(n2 + n3)/ε− n1/(1− ε)] + (p− 1)φe2[(n1 + n3)/ε− n2/(1− ε)]

+ (1− 2p)φe3pn1[(n3/ε)− (2/3)(n1 + n2)/(1− 2ε/3)]} .

Ruth
New Stamp
Ruth
Typewritten Text
Ruth
Typewritten Text
Ruth
page 3
Page 12: Estimation of Allele Frequencies From High-Coverage Genome ... › content › genetics › 182 › 1 › 295.full.pdf · DOI: 10.1534/genetics.109.100479 Estimation of Allele Frequencies

M. Lynch 4 SI

FILE S2

The MLAlleleFreqSimulatorVariableCoverage.cpp program is available at http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1