article evaluating the use of abba–baba statistics to locate

14
Article Evaluating the Use of ABBA–BABA Statistics to Locate Introgressed Loci Simon H. Martin,* ,1 John W. Davey, 1 and Chris D. Jiggins 1 1 Department of Zoology, University of Cambridge, Cambridge, United Kingdom *Corresponding author: E-mail: [email protected]. Associate editor: Doris Bachtrog Abstract Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson’s D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole-genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic ^ f d , a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. ^ f d is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and ^ f d outliers tend to cluster in regions of low absolute divergence (d XY ), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci. Key words: ABBA–BABA, gene flow, introgression, population structure, Heliconius, simulation. Introduction Hybridization and gene flow between taxa play a major role in evolution, acting as a force against divergence, and as a po- tential source of adaptive novelty (Abbott et al. 2013). Although identifying gene flow between species has been a long-standing problem in population genetics, the issue has received considerable recent attention with the analysis of shared ancestry between humans and Neanderthals (e.g., Yang et al. 2012; Wall et al. 2013; Sankararaman et al. 2014). With genomic data sets becoming available in a wide variety of other taxonomic groups, there is a need for reliable, com- putationally tractable methods that identify, quantify, and date gene flow between species in large data sets. A sensitive and widely used approach to test for gene flow is to fit coalescent models using maximum-likelihood or Bayesian methods (Pinho and Hey 2010). However, simula- tion and model fitting are computationally intensive tasks and are not easily applied on a genomic scale. A simpler and more computationally efficient approach that is gaining in popularity is to test for an excess of shared derived variants using a four-taxon test (Kulathinal et al. 2009; Green et al. 2010; Durand et al. 2011). The test considers ancestral (“A”) and derived (“B”) alleles and is based on the prediction that two particular single nucleotide polymorphism (SNP) pat- terns, termed “ABBA” and “BABA” (see Materials and Methods), should be equally frequent under a scenario of incomplete lineage sorting without gene flow. An excess of ABBA or BABA patterns is indicative of gene flow between two of the taxa and can be detected using Patterson’s D statistic (Green et al. 2010; Durand et al. 2011; see Materials and Methods for details). However, an excess of shared de- rived variants can arise from factors other than recent intro- gression, in particular nonrandom mating in the ancestral population due to population structure (Eriksson and Manica 2012). It is therefore important to make use of addi- tional means to distinguish between these alternative hypoth- eses, for example, by examining the size of introgressed tracts (Wall et al. 2013), or the level of absolute divergence in intro- gressed regions (Smith and Kronforst 2013). The D statistic was originally designed to be applied on a genome-wide or chromosome-wide scale, with block-jack- knifing used to overcome the problem of nonindependence between loci (Green et al. 2010). However, many researchers are interested in identifying particular genomic regions sub- ject to gene flow, rather than simply estimating a genome- wide parameter. Theory predicts that the rate of gene flow should vary across the genome, both in the case of secondary contact after isolation (Barton and Gale 1993) as well as con- tinuous gene flow during speciation (Wu 2001). Indeed, a maximum-likelihood test for speciation with gene flow de- vised by Yang (2010) is based on detecting this underlying heterogeneity. Moreover, adaptive introgression might lead to highly localized signals of introgression, limited to the partic- ular loci under selection. Many methods for characterizing heterogeneity in pat- terns of introgression across the genome have been proposed. Several genomic studies have used F ST to characterize hetero- geneity in divergence across the genome, often interpreting the variation in F ST as indicative of variation in rates of gene flow (e.g., Ellegren et al. 2012). However, it is well established ß The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access 244 Mol. Biol. Evol. 32(1):244–257 doi:10.1093/molbev/msu269 Advance Access publication September 22, 2014 Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550 by guest on 29 January 2018

Upload: vannga

Post on 10-Dec-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Article

    Evaluating the Use of ABBABABA Statistics to LocateIntrogressed LociSimon H. Martin,*,1 John W. Davey,1 and Chris D. Jiggins11Department of Zoology, University of Cambridge, Cambridge, United Kingdom

    *Corresponding author: E-mail: [email protected].

    Associate editor: Doris Bachtrog

    Abstract

    Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wideexcess of shared derived alleles between taxa using Pattersons D statistic, but does not establish which loci show such anexcess or whether the excess is due to introgression or ancestral population structure. Several recent studies haveextended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we usesimulations and whole-genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions.We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing Doutliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic f d, a modifiedversion of a statistic originally developed to estimate the genome-wide fraction of admixture. f d is not subject to the samebiases as D, and is better at identifying introgressed loci. Finally, we show that both D and f d outliers tend to cluster inregions of low absolute divergence (dXY), which can confound a recently proposed test for differentiating introgressionfrom shared ancestral variation at individual loci.

    Key words: ABBABABA, gene flow, introgression, population structure, Heliconius, simulation.

    IntroductionHybridization and gene flow between taxa play a major role inevolution, acting as a force against divergence, and as a po-tential source of adaptive novelty (Abbott et al. 2013).Although identifying gene flow between species has been along-standing problem in population genetics, the issue hasreceived considerable recent attention with the analysis ofshared ancestry between humans and Neanderthals (e.g.,Yang et al. 2012; Wall et al. 2013; Sankararaman et al. 2014).With genomic data sets becoming available in a wide varietyof other taxonomic groups, there is a need for reliable, com-putationally tractable methods that identify, quantify, anddate gene flow between species in large data sets.

    A sensitive and widely used approach to test for gene flowis to fit coalescent models using maximum-likelihood orBayesian methods (Pinho and Hey 2010). However, simula-tion and model fitting are computationally intensive tasksand are not easily applied on a genomic scale. A simplerand more computationally efficient approach that is gainingin popularity is to test for an excess of shared derived variantsusing a four-taxon test (Kulathinal et al. 2009; Green et al.2010; Durand et al. 2011). The test considers ancestral (A)and derived (B) alleles and is based on the prediction thattwo particular single nucleotide polymorphism (SNP) pat-terns, termed ABBA and BABA (see Materials andMethods), should be equally frequent under a scenario ofincomplete lineage sorting without gene flow. An excess ofABBA or BABA patterns is indicative of gene flow betweentwo of the taxa and can be detected using Pattersons Dstatistic (Green et al. 2010; Durand et al. 2011; see Materials

    and Methods for details). However, an excess of shared de-rived variants can arise from factors other than recent intro-gression, in particular nonrandom mating in the ancestralpopulation due to population structure (Eriksson andManica 2012). It is therefore important to make use of addi-tional means to distinguish between these alternative hypoth-eses, for example, by examining the size of introgressed tracts(Wall et al. 2013), or the level of absolute divergence in intro-gressed regions (Smith and Kronforst 2013).

    The D statistic was originally designed to be applied on agenome-wide or chromosome-wide scale, with block-jack-knifing used to overcome the problem of nonindependencebetween loci (Green et al. 2010). However, many researchersare interested in identifying particular genomic regions sub-ject to gene flow, rather than simply estimating a genome-wide parameter. Theory predicts that the rate of gene flowshould vary across the genome, both in the case of secondarycontact after isolation (Barton and Gale 1993) as well as con-tinuous gene flow during speciation (Wu 2001). Indeed, amaximum-likelihood test for speciation with gene flow de-vised by Yang (2010) is based on detecting this underlyingheterogeneity. Moreover, adaptive introgression might lead tohighly localized signals of introgression, limited to the partic-ular loci under selection.

    Many methods for characterizing heterogeneity in pat-terns of introgression across the genome have been proposed.Several genomic studies have used FST to characterize hetero-geneity in divergence across the genome, often interpretingthe variation in FST as indicative of variation in rates of geneflow (e.g., Ellegren et al. 2012). However, it is well established

    The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work isproperly cited. Open Access244 Mol. Biol. Evol. 32(1):244257 doi:10.1093/molbev/msu269 Advance Access publication September 22, 2014Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550

    by gueston 29 January 2018

    for example,('A''B', 'ABBA' 'BABA' ,Patterson's --maximum FSTFSTfor exampleXPath error Undefined namespace prefixXPath error Undefined namespace prefix

  • that, as a relative measure of divergence, FST is dependent onwithin-population genetic diversity (Charlesworth 1998), andis therefore an unreliable indicator of how migration ratesvary across the genome. In particular, heterogeneity in puri-fying selection and recombination rate could confound FST-based studies (Noor and Bennett 2009; Hahn et al. 2012;Roesti et al. 2012; Cruickshank and Hahn 2014). Various stud-ies of admixture among human populations, or betweenhumans and Neanderthals, have used probabilistic methodsto assign ancestry to haplotypes, and infer how this ancestrychanges across a chromosome (Sankararaman et al. 2008;Price et al. 2009; Henn et al. 2012; Lawson et al. 2012;Omberg et al. 2012; Churchhouse and Marchini 2013;Maples et al. 2013; Sankararaman et al. 2014). Other methodshave modeled speciation with the allowance for variable in-trogression rates among loci (Garrigan et al. 2012; Roux et al.2013), allowing the detection of more ancient gene flow.

    There have also been recent attempts to characterize het-erogeneity in patterns of introgression across the genomeusing the D statistic, calculated either in small windows(Kronforst et al. 2013; Smith and Kronforst 2013) or for indi-vidual SNPs (Rheindt et al. 2014). The robustness of the Dstatistic for detecting a genome-wide excess of shared derivedalleles has been thoroughly explored (Green et al. 2010;Durand et al. 2011; Yang et al. 2012; Eaton and Ree 2013;Martin et al. 2013). However, it has not been establishedwhether D provides a robust and unbiased means to identifyindividual loci with an excess of shared derived alleles, or todemonstrate that these loci have been subject to introgres-sion. Any inherent biases of the D statistic when applied tospecific loci have implications for methods that assume itsrobustness.

    For example, Smith and Kronforst (2013) made use of theD statistic in a proposed test to distinguish between the hy-potheses of introgression and shared ancestral variation atwing-patterning loci of Heliconius butterflies. Two wing-patterning loci are known to show an excess of shared derivedalleles between comimetic populations of Heliconius melpo-mene and H. timareta (Heliconius Genome Consortium 2012).At one of these loci, phylogenetic evidence and patterns oflinkage disequilibrium are consistent with recent gene flow(Pardo-Diaz et al. 2012). Nevertheless, Smith and Kronforst(2013) argue that this shared variation might represent anancestral polymorphism that was maintained through thespeciation event by balancing selection. Conceptually, this isnot unlike the population structure argument of Eriksson andManica (2012), except that here structure is limited to one ora few individual loci.

    Smith and Kronforst proposed that the alternative expla-nations of introgression or ancestral polymorphism could bedistinguished by considering absolute divergence within andoutside of the loci of interest. Both hypotheses predict anexcess of shared derived alleles at affected loci, but introgres-sion should lead to reduced absolute divergence due to morerecent coalescence at these loci, whereas the locus-specificpopulation structure hypothesis predicts no reduction in ab-solute divergence at these loci compared with other loci inthe genome. Loci with an excess of shared derived alleles, and

    therefore showing evidence of shared ancestry, were locatedby calculating the D statistic in nonoverlapping 5-kb windowsacross genomic regions of interest, and identifying outliersusing an arbitrary cutoff (the 10% of windows with the high-est D values). The mean absolute genetic divergence (dXY) wasthen compared between the outliers and nonoutliers, andfound to be significantly lower in outlier windows, consistentwith recent introgression (Smith and Kronforst 2013). Thismethod makes two assumptions. First, that the D statistic canaccurately identify regions that carry a significant excess ofshared variation, and second, that D outliers do not haveinherent biases leading to their cooccurrence with regionsof low absolute divergence. These assumptions, whichextend the use of D beyond its original definition, may bemade by other researchers for similar purposes, but theyremain to be tested.

    Here, we first assess the reliability of the D statistic as ameans to quantify introgression at individual loci. Using sim-ulations of small sequence windows, we compare D to a re-lated statistic that was developed by Green et al. (2010)specifically for estimating f, the proportion of the genomethat has been shared, and we propose improvements tothis statistic. We then use whole-genome data from severalHeliconius species to investigate how these statistics performon empirical data, and specifically how they are influenced byunderlying heterogeneity in diversity across the genome.Lastly, we use a large range of simulated data sets to testthe proposal that recent gene flow can be distinguishedfrom shared ancestral variation based on absolute divergencein D outlier regions.

    Results

    The D Statistic Is Not an Unbiased Estimator ofGene Flow

    Pattersons D statistic was developed to detect, but not toquantify introgression. We used the deterministic derivationof the expected value of D (E[D]) provided by Durand et al.(2011, eq. 5) to test how sensitive the value of D is to otherfactors apart from the proportion of introgression. We definethe proportion of introgression (f) as the proportion of hap-lotypes in the recipient population (P2) that trace their an-cestry through the donor population (P3) at the time of geneflow (fig. 1A). The expected D value increases with the pro-portion of introgression (f), but not linearly (fig. 1B) and ex-pected D increases as population size decreases (fig. 1B and C).The split times between populations also have a small effect,with a more recent split between P1 and P2 leading to higherexpected values of D (fig. 1C). This implies that empiricallycalculated values of the D statistic will depend on variousparameters other than the amount of gene flow, irrespectiveof the number of sites analyzed.

    Direct Estimators of f Outperform D on SimulatedData

    Analysis of simulated data confirmed that the D statistic isnot an appropriate measure for quantifying introgression oversmall genomic windows, but that direct estimation of the

    245

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    FSTFST, , , , , , Kronforst etal. 2013-Heliconius to -5 --Patterson's ,

  • proportion of introgression (f) provides a more robust alter-native. The D statistic (eq. 1, Materials and Methods) wascompared with the f estimator of Green et al. (2010) (eq. 4,Materials and Methods), which is referred to as f G below,along with two proposed modified versions of this statistic(eqs. 5 and 6, Materials and Methods). The first, f hom (eq. 5), issimilar to f G in that it explicitly assumes unidirectional geneflow from P3 to P2, but makes a further assumption thatmaximal introgression would lead to complete homogeniza-tion of allele frequencies in P2 and P3. This is a conservativeassumption, as an extremely high rate of migration would benecessary to attain a maximal value of f hom. The second, f d(eq. 6), is dynamic in that it allows for bidirectional introgres-sion on a site-by-site basis, setting the donor population ateach site as that which has the higher frequency of the derivedallele. These f estimators are distinct from the F2, F3, and F4statistics of Reich et al. (2009, 2012) and Patterson et al.(2012), which all test for correlated allele frequencies associ-ated with introgression (much like the D statistic). However,Pattersons (2012) F4-ratio is conceptually very similar to the festimators discussed here, in that it estimates the propor-tional contribution of a donor population.

    To compare the utility of these statistics for quantifyingintrogression in small genomic windows, we simulatedsequences from four populations: P1, P2, and P3 andoutgroup O, with the relationship (((P1,P2),P3),O), with asingle instantaneous gene flow event, either from P3 to P2or from P2 to P3. Simulations were performed over a range ofdifferent values of f (the probability that any particular hap-lotype is shared during the introgression event), and withvarious window sizes, recombination rates, and times ofgene flow.

    A subset of the results are shown in figure 2, and full resultsare provided in supplementary figure S1, SupplementaryMaterial online. In general, the D statistic proved sensitiveto the occurrence of introgression, with strongly positivevalues for any nonzero value of f. However, it was a poorestimator of the amount of introgression, as defined by thesimulated value of f (fig. 2). Moreover, D values showed dra-matic variance, particularly at low simulated values of f. Evenin the absence of any gene flow, a considerable proportion ofwindows had intermediate D values. This variance decreasedwith increasing window size and recombination rate (supple-mentary fig. S1, Supplementary Material online).

    In simulations of gene flow from P3 to P2, all three f esti-mators gave fairly accurate estimates of the simulated f value,provided gene flow was recent (fig. 2 and supplementary fig.S1A, Supplementary Material online). When gene flow oc-curred further back in time, f estimators tended to give

    FIG. 1. Expected value of the D statistic. (A) Durand et al. (2011)sderivation of the expected value of Pattersons D statistic E[D] dependson the two split times, t12 and t23, separating populations P1, P2, and P3.It assumes a single instantaneous admixture event from P3 to P2 at tGF,after which a proportion f of P2 individuals trace their ancestry throughP3. The effective population size, Ne, is constant through time and the

    FIG. 1. Continuedsame in all populations. (B) The expected value of D as a function of f,the proportion of introgression, at three different effective populationsizes: 0.5, 1, and 2 million. Split times are fixed at 1 million generationsfor t12 and 2 million generations for t23. (C) The expected value of D as afunction of Ne, showing the effect of varying t12. In all three cases, t23 isset at 2 million generations ago, and f is set to 0.1.

    246

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    to equation http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1-http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1, http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1

  • underestimates, but were nevertheless well correlated withthe simulated f value (supplementary fig. S1A, SupplementaryMaterial online). This is unsurprising, as genetic drift and theaccumulation of mutations in these lineages after the intro-gression event should dilute the signal of introgression. Insimulations of gene flow in the opposite direction, from P2to P3, both f G and f hom showed considerable stochasticity,particularly when recombination rates were low and geneflow was recent (supplementary fig. S1, SupplementaryMaterial online). The size of the window had little effect onthis behavior (supplementary fig. S1, Supplementary Materialonline), implying that it was not an effect of the number ofsites analyzed, but rather the level of independence amongsites. This is also to be expected, as these statistics can givevalues greater than 1 where derived allele frequencies happen

    by chance to be higher in P3 than P2 (see Materials andMethods). Unlike these two statistics, f d behaved predictablyat all recombination rates and times of gene flow, giving es-timates that were fairly well correlated with the simulated f,but underestimating its absolute value (fig. 2 and supplemen-tary fig. S1, Supplementary Material online).

    Generally, the variance in f d was lower than in the othertwo f estimators (supplementary fig. S1AI, SupplementaryMaterial online). Unlike the D statistic, f d displayed minimalvariance at low simulated values of f (fig. 2). However, all fourstatistics showed greater variance and more extreme valueswhen recombination rates were lower (supplementary fig. S2,Supplementary Material online), as expected given that de-creased recombination reduces the number of independentsites analyzed.

    FIG. 2. Comparing statistics to detect and quantify introgression. Results from a subset of the simulations: window size 5 kb, time of gene flow (tGF)0.1 4N generations ago, and population recombination rate 0.01. See supplementary figure S1AI, Supplementary Material online, for full results. Plotsshow means and standard deviations for D and f d , calculated over 100 simulated sequences (see Materials and Methods for details). Simulationscovered 11 different values of f, the proportion of introgression. Gene flow was simulated either from P3 to P2 (left-hand column) or from P2 to P3 (right-hand column). Dashed diagonal lines show the expectation of a perfect estimator of f.

    247

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1, http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1

  • Although none of the examined measures were able toaccurately quantify both forms of introgression in all cases, f dhas some appealing characteristics as a measure to identifyintrogressed loci in a genome scan approach. It has low var-iance and is not prone to false positives when gene flow isabsent and recombination rare. In all of our simulations, itprovided estimates that were proportional to the simulatedlevel of introgression. Although it tended toward underesti-mates, genome scans for introgressed loci would primarily beinterested in relative rates of introgression across the genome,rather than absolute rates.

    f Estimators Are Robust to Variation in NucleotideDiversity Across the Genome

    Analysis of published whole-genome data from Heliconiusspecies confirmed that Pattersons D statistic was prone toextreme values in regions of low diversity, whereas f estima-tors were not (fig. 3AC). We reanalyzed published whole-genome sequence data from two closely related Heliconiusbutterfly species, H. melpomene and H. timareta, and fouroutgroup species from the related silvaniform clade. Theraces H. melpomene amaryllis and H. timareta thelxinoe aresympatric in Peru, and show genome-wide evidence of geneflow (Martin et al. 2013), with particularly strong signals attwo wing-patterning loci: HmB, which controls red patternelements, and HmYb, which controls yellow and white pat-tern elements (Heliconius Genome Consortium 2012; Pardo-Diaz et al. 2012). To determine whether heterogeneity indiversity across the genome may influence D and the f esti-mators, we calculated D, f G, f hom, f d, and nucleotide diversity() in nonoverlapping 5-kb windows across the genome.Variance in the D statistic was highest among windowswith low nucleotide diversity and decreased rapidly with in-creasing diversity (fig. 3A and supplementary fig. S4,Supplementary Material online). Windows from the wing-patterning loci were among those with the highest Dvalues, but there were many additional windows with Dvalues approaching or equal to 1. By contrast, f d, calculatedfor all windows with positive D, was far less sensitive to thelevel of diversity, with most outlying windows showing inter-mediate levels of diversity (fig. 3B). Notable exceptions werewindows located within the wing-patterning regions, whichtended to have high f d values and below average diversity.This is consistent with the strong selection known to actupon the patterning loci. The lack of extreme f d values inwindows with low diversity suggests that most of the D out-liers are spurious, and that f d provides a better measure ofwhether a locus has shared ancestry between species. Finally,we also tested the other two f estimators described here: f Gand f hom (eqs. 4 and 5, Materials and Methods). Both per-formed similarly to f d except that both had higher variance(supplementary figs. S3 and S4, Supplementary Materialonline), consistent with the simulations reported above (sup-plementary fig. S1, Supplementary Material online), and bothgave a considerable number of values greater than 1, confirm-ing that f d was the most conservative and stable statistic.

    Taken together, these findings demonstrate that, whensmall genomic windows are analyzed, a high D value aloneis not sufficient evidence for introgression. Many of the Doutlier loci probably represent statistical noise, concentratedin regions of low diversity, whereas outliers for the f estimates,and particularly f d, tend to be less biased.

    This effect could also be observed on the scale of wholechromosomes. The variance in D among 5-kb windows foreach of the Heliconius chromosomes (n = 21) was stronglynegatively correlated with the average diversity per chromo-some (r[19] = 0.936, P< 0.001; fig. 3C). This relationshipwas most clearly illustrated by the Z chromosome: It hadthe lowest diversity by some margin, as expected given itsreduced effective population size, and the highest varianceamong D values for 5-kb windows, despite the fact that pre-vious chromosome-wide analyses suggest very limited geneflow affecting this chromosome (Martin et al. 2013). By con-trast, the variance in f d, estimated for all windows with pos-itive D, had a weak positive correlation with the meandiversity per chromosome (r[19] = 0.440, P< 0.05). This wasdriven by the fact that the Z chromosome had the lowestdiversity and also the lowest variance in f d, as expected giventhe reduced gene flow affecting this chromosome. When theZ chromosome was excluded, there was no significant rela-tionship between the variance in f d values and average diver-sity (r[18] = 0.092, P 4 0.05). We also considered the effect ofwindow size on the variance of D and the estimators of f. Aswindow size increases, the higher variance of D in regions oflower diversity persists, but becomes less extreme (supple-mentary fig. S4, Supplementary Material online). In summary,these data show that extreme D values, both positive andnegative, occur disproportionately in genomic regions withlower diversity, whereas f d values are less biased by underlyingheterogeneity in genetic variation.

    Inherent Biases in the D and f Statistics Confound aTest to Distinguish between Introgression and SharedAncestral Variation

    The biases associated with the D statistic described abovemay have important consequences for methods that use Dto identify candidate introgressed regions. For example, Smithand Kronforst (2013) proposed a method to discriminatebetween gene flow and shared ancestral variation that reliesupon D values calculated for small genomic regions (seeIntroduction). Briefly, the Smith and Kronforst test calculatedD for all nonoverlapping 5-kb windows. Absolute divergence(dXY) was then compared between the set of windows thatwere outliers for the D statistic (defined as the windows withthe top 10% of D values) and the remaining 90% of nonoutlierwindows. The method predicts that introgression betweenspecies at a specific genomic region should reduce the be-tween-species divergence in this region as compared with therest of the genome, whereas shared ancestry due to ancestralpopulation structure would not lead to lower divergence. Wefirst confirmed this prediction using simulations, and thenassessed whether biases in the D statistic might affect thepower of the method.

    248

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    Patterson's --Heliconius Heliconius , -5 , http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1, http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC15 ()=-5 ()=()=http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1--to while

  • To test the prediction that introgression and ancestralpopulation structure leave distinct footprints in terms of ab-solute divergence, 10,000 sequence windows for three popu-lations (P1,P2,P3) and an outgroup (O) were simulated. Intotal, 9,000 windows were defined as Background, havingthe topology (((P1,P2),P3),O), without any gene flow or pop-ulation structure. The remaining 1,000 windows were definedas Alternate and were subject to either gene flow or struc-ture (see Materials and Methods for details). Ten percent ofwindows were defined as Alternate to match Smith andKronforsts design, wherein the top 10% of D values aretaken as outliers. Three different Alternate scenarios wereconsidered: Gene flow from P2 to P3, gene flow from P3 toP2 and ancestral structure leading to shared ancestry betweenP2 and P3. For all scenarios, Alternate windows were definedwith the topology ((P1,(P2,P3)),O). In the gene flow scenarios,the split time between P2 and P3 in the Alternate topologywas set to be more recent than the split between P1 and P2 inthe Background topology (fig. 4A and B). In the ancestralstructure scenario, the split time between P1 and P2 in theAlternate topology was set to be more ancient than the splitbetween P2 and P3 in the Background topology (fig. 4C). Thiswas designed to model a region of the genome undergoingbalancing selection or some other process that maintainspolymorphism at particular loci before the speciation event.Gene flow or structure in the Alternate windows can beconsidered to be complete (f = 1). For example, under geneflow from P2 to P3, all P3 alleles trace their ancestry through P2at the time of gene flow. This simplified design, where geneflow or structure is absent in 90% of the sequences and com-plete in 10%, allowed for the most straightforward and pre-dictable test of Smith and Kronforsts method; if the logic ofthe method does not follow in this extreme scenario, it isunlikely to do so in more complex situations.

    For each of the three evolutionary scenarios, 120 differentpermutations of split times and times of gene flow or struc-ture were simulated. The split times and times of gene flowfor all models are given in the first three columns of supple-mentary tables S1S3, Supplementary Material online. Tosimplify our comparisons between models, we focused speci-fically on dXY between P2 and P3, the most relevant parameterwhen testing for introgression between P2 and P3. We testedwhether P2P3 dXY was significantly lower in the Alternatewindows (those that had experienced gene flow or structure)compared with the Background windows, using a Wilcoxonrank-sum test, with Bonferroni correction over all 120 models

    FIG. 3. Effects of genetic diversity on D and f d in Heliconius whole-genome data. (A and B). Values of D and f d for nonoverlapping 5-kb

    FIG. 3. Continuedwindows across the genome, plotted against nucleotide diversity. f dvalues are only plotted for windows with D0. Data from Martinet al. (2013). Taxa used are as follows, P1: Heliconius melpomene aglaope,P2: H. m. amaryllis, P3: H. timareta thelxinoe, O: four Heliconius speciesfrom the silvaniform clade (see Materials and Methods for list). Coloredpoints show windows located within the wing-patterning loci HmB(red) and HmYb (yellow). (C) The variance among D and fd values foreach chromosome, plotted against the mean nucleotide diversity fromall windows for each chromosome.

    249

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    10 'Background''Alternate' ,prior toKronforst's http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1-to

  • of the same type, and a significance threshold of 99%. We firstperformed simulations with a recombination rate parameter(4Nr) of 0.01, and later repeated all simulations at 4Nr = 0.001.The results of all tests are given in full in supplementary tablesS1S3, Supplementary Material online. These results are sum-marized in table 1 and figure 5, and a single illustrative exam-ple for each model type is given in figure 4EL.

    As predicted, in all models simulating gene flow, averagedXY between P2 and P3 was significantly lower in Alternatewindows compared with Background windows. In contrast, inall models simulating ancestral population structure, therewas no significant difference in P2P3 dXY between theBackground and Alternate windows, again in agreementwith predictions. These findings therefore demonstrate thatthe intuitive premise of Smith and Kronforsts (2013) methodis justified.

    We then tested whether introgression could be distin-guished from shared ancestral variation where loci withshared ancestry are not known (as would be the situationwith empirical data), but are instead inferred by selecting thetop 10% of D values (outliers), following the Smith andKronforst method. We also tested this method using thetop 10% of f estimates among windows with positive D(using f d, f G, and f hom). Using the D statistic to identify out-liers, mean dXY between P2 and P3 was significantly reduced inoutlier windows as compared with nonoutlier windows in all120 models simulating gene flow from P3 to P2, and all butone of the models simulating gene flow from P2 to P3. Thesingle nonsignificant case had the most ancient possible t23and the most recent possible t12, and only 11.9% of D outlierwindows were genuine Alternate windows, the lowest recallof any model. Using any of the three f estimators, mean dXY

    FIG. 4. Simulations to evaluate a method to distinguish introgression from shared ancestral variation. (AC) Combined models were made up of 9,000sequence windows simulated under the Background topology (brown outline) and 1,000 windows simulated under an Alternate topology (coloredline). Three distinct evolutionary scenarios were simulated by varying the split times t12, t23, tGF, and tSTR; (A, E, I) Gene flow from P2 to P3, (B, F, J) geneflow from P3 to P2, (C, G, K) ancestral structure. (D, H, L) Null models were made up of 10,000 sequences simulated under the background topology only.(EL) Example data from a single simulated data set for each of the four types of models. Split times (in units of 4N generations) were as follows: t12 = 0.6in all four cases, t23 = 0.8 in all four cases, tGF = 0.4 in both gene flow models and tSTR = 1.0. Points show mean and standard deviation for P2P3 dXYcalculated over subsets of trees: Simulated Background and Alternate trees (brown and colored points) or nonoutliers and outliers (gray and blackpoints) identified using the D and f d statistics. A significant reduction in P2P3 dXY for the Alternate compared with Background windows, or for outlierscompared with nonoutliers, is indicated by astrices. (EH) Results of simulations with a population recombination rate (4Nr) of 0.01. (IL) Results for thesame models, but with a population recombination rate (4Nr) of 0.001.

    250

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1-to to --

  • between P2 and P3 was significantly reduced in outlier win-dows in all gene flow models.

    However, mean P2P3 dXY was also significantly reduced inD and f outlier windows in more than half of the 120 modelssimulating ancestral population structure (figs. 4C, 4G, and 5;table 1 and supplementary table S3, Supplementary Materialonline). This demonstrates that a simple test for reduceddivergence in P2P3 dXY among D or f outlier windowswould, under a range of ancestral structure scenarios, produceresults consistent with introgression. The fact that this biaswas similar whether D or f estimators were used to identifyoutliers indicates that there is an inherent tendency in all ofthese statistics toward regions with below-average divergencebetween P2 and P3. To confirm this finding, we analyzed a setof simulations using a null model, with no gene flow or struc-ture in any of the 10,000 windows, over 45 permutations ofsplit times (supplementary table S4, Supplementary Materialonline). Outlier windows showed significantly reduced dXYbetween P2 and P3 in most or all of the null models. Finally,we repeated all of these simulations with a lower within-window recombination rate parameter (4Nr) of 0.001. Thistended to increase the reduction in P2P3 dXY for outliers inancestral structure and null models (figs. 4IL and 5), with atmost three of the ancestral structure models showing nonsig-nificant drops in P2P3 dXY for outliers, and most or all nullmodels showing significantly lower P2P3 dXY for outliers, re-gardless of the statistic used (table 1).

    In summary, although shared ancestral variation and in-trogression can theoretically be distinguished based on thefact that only the latter should reduce dXY between P2 and P3,an inherent bias in both the D and f statistics makes a simpletest for a statistical difference in dXY between outliers andnonoutliers problematic. Both D and f outliers tendedtoward windows with lower P2-P3 dXY, regardless of the un-derlying evolutionary history, and particularly when recombi-nation rates were low. In the absence of any gene flow, theoutliers must therefore be identifying windows that coalescemore recently in the ancestral population. However, evenwhen the reduction in P2-P3 dXY was significant for ancestral

    structure or null models, it was typically smaller than thereductions in dXY seen in the gene flow models (fig. 5). Inthe presence of gene flow, some windows coalesce more re-cently than the species split, so the magnitude of the reduc-tion in P2-P3 dXY is greater. This difference could potentially beused to distinguish introgression from shared ancestral vari-ation, but can not be done with a simple significance test, andwill require a more sophisticated model-fitting approach.

    DiscussionWith the advent of population genomics, studies of speciesdivergence have moved from simply documenting in-terspecific gene flow, toward the identification of specificgenomic regions that show strong signals of either introgres-sion or divergence (Garrigan et al. 2012; Heliconius GenomeConsortium 2012; Staubach et al. 2012; Roux et al. 2013; Bosseet al. 2014; Huerta-Sanchez et al. 2014; Sankararaman et al.2014). This is a useful goal for many reasons. It can permit theidentification of large-scale trends, such as chromosomal dif-ferences, and the fine-scale localization of putative targets ofadaptive introgression for further characterization. Therefore,simple and easily computable statistics that can be used toidentify loci with a history of introgression have considerableappeal.

    Previous studies have explored the behavior of PattersonsD statistic, a test for gene flow based on detecting an inequal-ity in the numbers of ABBA and BABA patterns, using whole-genome analyses across large numbers of informative sites(Green et al. 2010; Yang et al. 2012; Eaton and Ree 2013;Martin et al. 2013; Wall et al. 2013). These studies haveshown that D is a robust method when applied as intended:To test for an excess of shared variation on a genome-widescale. Indeed, a major strength of the ABBABABA test is thatit combines data from across the genome, accounting forchance fluctuations among loci, and therefore is able todetect the net effect of gene flow. Moreover, the noninde-pendence among linked sites can be accounted for by block-jackknifing (Green et al. 2010).

    Table 1. Number of Simulated Models in Which P2P3 dXY Is Significantly Reduced in Alternate versus Background Windows, or in Outlier versusNonoutlier Windows for the Introgression Statistics.

    Model Type (No. of models) Number of Models with Significantly Reduced Mean P2P3 dXY

    Alternate versusBackground

    D Outliers versusNonoutliers

    fd Outliers versusNonoutliers

    fG Outliers versusNonoutliers

    fhom Outliers versusNonoutliers

    4Nr = 0.01

    Gene flow P3 ! P2 (120) 120 119 120 120 120Gene flow P2 ! P3 (120) 120 120 120 120 120Ancestral structure (120) 0 105 72 67 66

    Null (45) 0 39 44 45 44

    4Nr = 0.001

    Gene flow P3 ! P2 (120) 120 120 120 120 120Gene flow P2 ! P3 (120) 120 120 120 120 120Ancestral structure (120) 0 120 119 118 118

    Null (45) 0 32 45 45 45

    NOTE.Significantly lower dXY was evaluated using a Wilcoxon rank-sum test, with a 99% significance threshold after Bonferroni correction over the 120 models of each type(except for null models, of which there were 45).

    251

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    , 4, , http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC110 http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1-, 3 ---s, Patterson's whole --

  • However, it is not clear whether D can be extended beyondits original use to identify specific loci with introgressed var-iation. We have documented two main problems with thisapproach. First, D is not an unbiased estimator of the amountof introgression that has occurred. In particular, it is influ-enced by effective population size (Ne), leading to more ex-treme values when Ne is low. Second, when calculated oversmall windows, it is highly stochastic, particularly in genomicregions of low diversity and low recombination rate, such thatD outliers will tend to be clustered within these regions. Localreductions in genetic diversity along a chromosome can comeabout through neutral processes, such as population bottle-necks, but also through directional selection. Therefore, theseproblems may be exacerbated in studies specifically interestedin loci that experience strong selective pressures, as this wouldincrease the likelihood of detecting chance outliers at suchloci.

    Direct estimation of f, the proportion of introgression,holds more promise as a robust method for detecting intro-gressed loci. Green et al. (2010) proposed that f could beestimated by comparing the observed difference in thenumber of ABBA and BABA patterns to that which wouldbe expected in the event of complete introgression. As thisexpected value is calculated from the observed data, thismethod controls for differences in the level of standing var-iation, making it more suitable for application to small

    regions. In Green et al.s approach, complete introgressionfrom P3 to P2 was taken to mean that P2 would come toresemble a subpopulation of lineage P3. Here, we make theconservative assumption that complete introgression wouldlead to homogenization of allele frequencies, such that thefrequency of the derived allele in P2 would be identical to thatin P3. Green et al.s approach assumed unidirectional intro-gression from P3 to P2, but can lead to spurious values whenintrogression occurs in the opposite direction. We have there-fore proposed a new dynamic estimator of f, in which thedonor population can differ between sites, and is always thepopulation with the higher frequency of the derived allele.Although this conservative estimator leads to slight underes-timation of the amount of introgression that has occurred, itprovides an estimate that is roughly proportional to the levelof introgression, regardless of the direction. It is therefore amore suitable measure for identifying introgressed loci. This issupported by our analysis of whole-genome data fromHeliconius butterflies, where many 5-kb windows had maxi-mal D values (D = 1), but only a few had high f d values, thevast majority of which were located around the wing-pattern-ing loci previously identified as being shared between thesespecies through adaptive introgression (Heliconius GenomeConsortium 2012; Pardo-Diaz et al. 2012).

    The sensitivity of D to heterogeneous genomic diversity islikely to affect studies that have drawn conclusions from D

    FIG. 5. Mean dXY between P2 and P3 in outlier windows as a percentage of P2P3 dXY in nonoutlier windows. Outlier windows defined by Alternate orBackground topology (simulation) or by outlying D and f values, as per figure 4. Model types shown in color (gene flow from P2 to P3, green;gene flow from P3 to P2, blue; ancestral structure, red; null model, brown). Results for two different recombination rates are shown (4Nr = 0.01, left;4Nr = 0.001, right).

    252

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    lyly'.'s

  • statistics calculated for particular genome regions. For exam-ple, Wall et al. (2013) identified regions carrying putative long(8100 kb) haplotypes segregating in European humans, andthen found that these regions showed evidence of aNeanderthal origin, as indicated by elevated D statistics.However, it may be that such haplotypes would be overrep-resented in low-recombination regions, which also tend tohave reduced diversity in humans and many other species(Cutter and Payseur 2013). In another recent Heliconius study,FST was calculated for 5-kb windows across the genome.Windows showing increased differentiation between H. mel-pomene and H. pachinus (according to FST) also showed sig-nificantly elevated D statistics in a test for introgressionbetween the same species pair (Kronforst et al. 2013). Thisillustrates how the sensitivity of both D and FST to within-species diversity can produce conflicting results. This sensitiv-ity is likely to be particularly problematic in studies using verysmall genomic regions. At the extreme, Rheindt et al. (2014)calculated D for single SNPs and predicted that genes linkedto SNPs with outlying D values are more likely to have beenintrogressed.

    In the present study, to investigate whether biases in Dwhen calculated over small regions could influence subse-quent analyses, we investigated a recently proposedmethod to distinguish between introgression and shared an-cestral variation (Smith and Kronforst 2013). The premise ofthis test is that introgression should result in an excess ofshared derived alleles and a reduction in absolute divergence(dXY), whereas shared ancestral variation will exhibit theformer but not the latter signature. Our simulations con-firmed that the intuitive predictions of this method arevalid, but also showed that this test can be misled by theuse of D to identify outliers. Windows that were outliers for Dexhibited below average dXY in simulations with gene flow,but also in most simulations with ancestral structure, orwhere both gene flow and ancestral population structurewere absent. All three f estimators also failed to distinguishbetween introgression and ancestral structure in manymodels. This implies that all of these statistics are systemat-ically biased toward regions that coalesce more recently, re-gardless of whether gene flow has occurred.

    We predict that D would have additional problems in realgenomes, where selective constraint leads to a correlationbetween within-species diversity and between-species diver-gence, causing D outliers to be even more strongly associatedwith reduced dXY. However, it is notable that the reduction indivergence among D and f outliers was almost always greaterin simulations with introgression than in simulations withancestral structure or with no Alternate topology, across alarge range of split times and dates of gene flow. There may,therefore, be considerable information about the evolutionaryhistory of DNA sequences present in the joint distribution ofdXY and f d. On the other hand, in real data, levels of diver-gence can vary dramatically due to heterogeneity in selectiveconstraint, mutation rate, and recombination rate, whichwould exaggerate the problems described here. Even an un-biased statistic, when applied to small genomic windows,would be confounded by heterogeneity in recombination

    rate across the genome. In regions of reduced recombination,fewer independent data points are sampled by each window,so extreme estimates become more likely. Heterogeneity inrecombination rate is therefore an essential consideration inany study that aims to scan the genome for regions ofinterest.

    ConclusionsIn an era of increasing availability of genomic data, there is ademand for simple summary statistics that can reliably iden-tify genomic regions that have been subject to selection, in-trogression and other evolutionary processes. It seemsunlikely, however, that any single summary statistic will beable to reliably distinguish these processes from noise intro-duced by demography, drift, and heterogeneity in recombi-nation rate. Here, we have shown that, while Pattersons Dstatistic provides a robust signal of shared ancestry across thegenome, it should not be used for nave scans to ascribeshared ancestry to small genomic regions, due to its tendencytoward extreme values in regions of reduced variation.Estimation of f, the proportion of introgression, particularlyusing our proposed statistic f d, provides a better means ofidentifying putatively introgressed regions. Nevertheless, bothD and f d tend to identify regions of reduced interspeciesdivergence, even in the absence of gene flow, which mayconfound tests to distinguish between recent introgressionand shared ancestral variation based on absolute divergence(dXY) in outlier regions. However, the joint distribution of dXYand f statistics may be a useful summary statistic for model-fitting approaches to distinguish between these evolutionaryhypotheses.

    Materials and Methods

    Statistics Used to Detect Shared Ancestry

    In this study, we focused on an approach to identify an excessof shared derived polymorphisms, indicated by the relativeabundance of two SNP patterns termed ABBAs and BABAs(Green et al. 2010). Given three populations and an outgroupwith the relationship (((P1, P2), P3), O) (fig. 1A), ABBAs aresites at which the derived allele B is shared between thenonsister taxa P2 and P3, whereas P1 carries the ancestralallele, as defined by the outgroup. Similarly, BABAs are sitesat which the derived allele is shared between P1 and P3,whereas P2 carries the ancestral allele. Under a neutral coa-lescent model, both patterns can only result from incompletelineage sorting or recurrent mutation, and should be equallyabundant in the genome (Durand et al. 2011). A significantexcess of ABBAs over BABAs is indicative either of gene flowbetween P2 and P3, or some form of nonrandom mating orstructure in the population ancestral to P1, P2, and P3. Thisexcess can be tested for, using Pattersons D statistic,

    D P1; P2; P3;O X

    CABBA i CBABA i XCABBA i CBABA i

    1

    where CABBA(i) and CBABA(i) are counts of either 1 or 0, de-pending on whether or not the specified pattern (ABBA or

    253

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    --FSTFSTFST-``''``''``'' -while while -'

  • BABA) is observed at site i in the genome. Under thenull hypothesis of no gene flow and random mating inthe ancestral population, D will approach zero, regardlessof differences in effective population sizes (Durand et al.2011). Hence, a D significantly greater than zero is indicativeof a significant excess of shared derived alleles between P2and P3.

    If population samples are used, then rather than binarycounts of fixed ABBA and BABA sites, the frequency of thederived allele at each site in each population can be used(Green et al. 2010; Durand et al. 2011), effectively weightingeach segregating site according to its fit to the ABBA or BABApattern, with

    CABBA i 1 pi1

    pi2pi3 1 pi4

    2

    CBABA i pi1 1 pi2

    pi3 1 pi4

    3

    where pij is the frequency of the derived allele at site i inpopulation j. These values are then used in equation 1 tocalculate D (Durand et al. 2011).

    Green et al. (2010) also proposed a related method toestimate f, the fraction of the genome shared through intro-gression (Green et al. 2010; Durand et al. 2011). This methodmakes use of the numerator of equation 1, the differencebetween sums of ABBAs and BABAs, which is called S. Inthe example described above, with ((P1,P2),P3),O), the propor-tion of the genome that has been shared between P2 and P3subsequent to the split between P1 and P2 can be estimatedby comparing the observed value of S to a value estimatedunder a scenario of complete introgression from P3 to P2. P2would then resemble a lineage of the P3 taxon, and so thedenominator of equation 1 can be estimated by replacing P2in equations 2 and 3 with a second lineage sampled from P3,or by splitting the P3 sample into two,

    f G S P1; P2; P3;O

    S P1; P3a; P3b;O 4

    where P3a and P3b are the two lineages sampled from P3.Splitting P3 arbitrarily in this way may lead to stochasticerrors at individual sites, particularly with small samplesizes. These should be negligible when whole-genome dataare analyzed but could easily lead to erroneous values of f(including f 4 1) when small genomic windows are ana-lyzed, as in the present study. We therefore used a moreconservative version, in which we assume that complete in-trogression from P3 to P2 would lead to complete homoge-nization of allele frequencies. Hence, in the denominator, P3aand P3b are both substituted by P3:

    f hom S P1; P2; P3;O S P1; P3; P3;O

    5

    Although this conservative assumption may lead to underes-timation of the proportion of sites shared, it also reduces therate of stochastic error. Moreover, in the present study, we areless concerned with the absolute value of f , and more withthe relative values of f between genomic regions.

    The f statistic assumes unidirectional gene flow from P3 toP2 (i.e., P3 is the donor and P2 is the recipient). Because thebranch leading to P3 is longer than that leading to P2 (fig. 1A),gene flow in the opposite direction (P2 to P3) is likely togenerate fewer ABBAs. Thus, in the presence of gene flowfrom P2 to P3, or in both directions, the f equation should leadto an underestimate. However, when small genomic windowsare analyzed, the assumption of unidirectional gene flowcould lead to overestimates, because any region in whichderived alleles are present in both P2 and P3, but happen tobe at higher frequency in P2, will yield f estimates that aregreater than 1. Thus, we propose a dynamic estimator inwhich the denominator is calculated by defining a donorpopulation (PD) for each site independently. For each site,PD is the population (either P2 or P3) that has the higherfrequency of the derived allele, thus maximizing the denom-inator and eliminating f estimates greater than 1:

    f d S P1; P2; P3;O S P1; PD; PD;O

    6

    Assessing the Ability of D and f Estimators toQuantify Introgression in Small Sequence Windows

    To assess how reliably Pattersons D statistic, and other esti-mators of f are able to quantify the actual rate of introgression,we simulated sequence data sets with differing rates of intro-gression using ms (Hudson 2002). For each data set, we sim-ulated 100 sequence windows for eight haplotypes each fromfour populations with the relationship (((P1,P2),P3),O). Thesplit times t12 and t23 (as on fig. 1A) were set to 1 4N gen-erations and 2 4N generations ago, respectively, and theroot was set to 3 4N generations ago. An instantaneous,unidirectional admixture event, either from P3 to P2 or fromP2 to P3, was simulated at a time tGF with a value f, whichdetermines the probability that each haplotype is shared. Wetested two different values for tGF: 0.1 and 0.5 4N genera-tions ago. For each direction of gene flow and each tGF, 11simulated data sets were produced, with f values ranging from0 (no gene flow) to 1 (all haplotypes are shared). Finally, theentire set of simulations was repeated with three differentwindow sizes: 1, 5, and 10 kb, and with three different recom-bination rates: 0.001, 0.01, and 0.1, in units of 4Nr, the pop-ulation recombination rate. DNA sequences were generatedfrom the simulated trees using Seq-Gen (Rambaut and Grass1997), with the Hasegawa-Kishino-Yano substitution modeland a branch scaling factor of 0.01. Simulations were run usingthe provided script compare_f_estimators.r, which generatesthe ms and Seq-Gen commands automatically. An exampleset of commands to simulate a single 5-kb sequence using thesplit times mentioned above, with gene flow from P3 to P2 attGF = 0.1 and f = 0.2, and with a recombination rate parameterof 0.01 would be:

    ms 32 1 -I 4 8 8 8 8 -ej 1 2 1 -ej 2 3 1 -ej 3 4 1 -es 0.1 2 0.8 -ej0.1 5 3 -r 50 5000 -T j tail -n + 4 j grep -v // 4 treefilepartitions=($(wc -l treefile))

    254

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    , , While Since '8 HKY

  • seq-gen -mHKY -l 5000 -s 0.01 -p $partitions

  • (4Nr) values of 0.01 and 0.001. Summary statistics for allmodels were compiled using generate_summary_statistics.R.

    Software

    All scripts mentioned in the text, along with the generateddatasets, are available at http://dx.doi.org/10.5061/dryad.j1rm6. This work was made possible by the free, opensource software packages EggLib (De Mita and Siol 2012),phyclust (Chen 2011), R (R Core Team 2013), ggplot2(Wickham 2009), plyr (Wickham 2011), reshape (Wickham2007), and Inkscape (http://www.inkscape.org, last accessedAugust 20, 2014).

    Supplementary MaterialSupplementary figures S1S4 and tables S1S4 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

    Acknowledgments

    The authors are grateful to Milan Malinsky for helpful anddetailed discussions about this work. The authors also thankJim Mallet, Marcus Kronforst, and two anonymous reviewersfor their comments on the manuscript. Richard Merrill andRichard Wallbank contributed to initial discussions that in-spired the work. This work was supported by the LeverhulmeTrust (F/09364/E to C.D.J.), the Herchel Smith Fund(Postdoctoral Fellowship to J.W.D.) and BBSRC (BB/H01439X/1, to C.D.J.).

    ReferencesAbbott R, Albach D, Ansell S, Arntzen JW, Baird SJ, Bierne N, Boughman

    J, Brelsford A, Buerkle CA, Buggs R, et al. 2013. Hybridization andspeciation. J Evol Biol. 26:229246.

    Barton NH, Gale KS. 1993. Genetic analysis of hybrid zones. In: Price J,Harrison RG, editors. Hybrid zones and the evolutionary process.New York: Oxford University Press.

    Bosse M, Megens H-J, Frantz LAF, Madsen O, Larson G, Paudel Y,Duijvesteijn N, Harlizius B, Hagemeijer Y, Crooijmans RP, et al.2014. Genomic analysis reveals selection for Asian genes inEuropean pigs following human-mediated introgression. NatCommun. 5:4392.

    Charlesworth B. 1998. Measures of divergence between populations andthe effect of forces that reduce variability. Mol Biol Evol. 15:538543.

    Chen W-C. 2011. Overlapping codon model, phylogenetic clustering,and alternative partial expectation conditional maximization algo-rithm [Ph.D. Dissertation]. Ames (IA): Iowa Stat University.

    Churchhouse C, Marchini J. 2013. Multiway admixture deconvolutionusing phased or unphased ancestral panels. Genet Epidemiol. 37:112.

    Cruickshank TE, Hahn MW. 2014. Reanalysis suggests that genomicislands of speciation are due to reduced diversity, not reducedgene flow. Mol Ecol. 23:31333157.

    Cutter AD, Payseur BA. 2013. Genomic signatures of selection at linkedsites: unifying the disparity among species. Nat Rev Genet. 14:262274.

    De Mita S, Siol M. 2012. EggLib: processing, analysis and simulation toolsfor population genetics and genomics. BMC Genet. 13:27.

    Durand EY, Patterson N, Reich D, Slatkin M. 2011. Testing for ancientadmixture between closely related populations. Mol Biol Evol. 28:22392252.

    Eaton D, Ree R. 2013. Inferring Phylogeny and Introgression usingRADseq data: an example from flowering plants (Pedicularis:Orobanchaceae). Syst Biol. 682:689706.

    Ellegren H, Smeds L, Burri R. 2012. The genomic landscape of speciesdivergence in Ficedula flycatchers. Nature 491:756760.

    Eriksson A, Manica A. 2012. Effect of ancient population structure onthe degree of polymorphism shared between modern human pop-ulations and ancient hominins. Proc Natl Acad Sci U S A. 109:1395613960.

    Garrigan D, Kingan SB, Geneva AJ, Andolfatto P, Clark AG, Thornton K,Presgraves DC. 2012. Genome sequencing reveals complex specia-tion in the Drosophila simulans clade. Genome Res. 22:14991511.

    Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, PattersonN, Li H, Zhai W, Fritz MH, et al. 2010. A draft sequence of theNeandertal genome. Science 328:710722.

    Hahn MW, White BJ, Muir CD, Besansky NJ. 2012. No evidence forbiased co-transmission of speciation islands in Anopheles gambiae.Philos Trans R Soc Lond B Biol Sci. 367:374384.

    Heliconius Genome Consortium. 2012. Butterfly genome reveals pro-miscuous exchange of mimicry adaptations among species. Nature487:9498.

    Henn BM, Botigue LR, Gravel S, Wang W, Brisbin A, Byrnes JK,Fadhlaoui-Zid K, Zalloua PA, Moreno-Estrada A, Bertranpetit J,et al. 2012. Genomic ancestry of North Africans supports back-to-Africa migrations. PLoS Genet. 8:e1002397.

    Hudson RR. 2002. Generating samples under a Wright-Fisher neutralmodel of genetic variation. Bioinformatics 18:337338.

    Huerta-Sanchez E, Jin X, Asan., Bianba Z, Peter BM, Vinckenbosch N,Liang Y, Yi X, He M, Somel M, et al. 2014. Altitude adaptation inTibetans caused by introgression of Denisovan-like DNA. Nature512:194197.

    Kronforst MR, Hansen MEB, Crawford NG, Gallant JR, Zhang W,Kulathinal RJ, Kapan DD, Mullen SP. 2013. Hybridization revealsthe evolving genomic architecture of speciation. Cell Rep. 5:666677.

    Kulathinal RJ, Stevison LS, Noor MAF. 2009. The genomics of speciationin drosophila: diversity, divergence, and introgression estimated us-ing low- coverage genome sequencing. PLoS Genet. 5(7):e1000550.

    Lawson DJ, Hellenthal G, Myers S, Falush D. 2012. Inference of popula-tion structure using dense haplotype data. PLoS Genet. 8:e1002453.

    Maples BK, Gravel S, Kenny EE, Bustamante CD. 2013. RFMix: a discrim-inative modeling approach for rapid and robust local-ancestryinference. Am J Hum Genet. 93:278288.

    Martin SH, Dasmahapatra KK, Nadeau NJ, Salazar C, Walters JR, SimpsonF, Blaxter M, Manica A, Mallet J, Jiggins CD. 2013. Genome-wideevidence for speciation with gene flow in Heliconius butterflies.Genome Res. 23:18171828.

    Noor MAF, Bennett SM. 2009. Islands of speciation or mirages in thedesert? Examining the role of restricted recombination in maintain-ing species. Heredity 103:439444.

    Omberg L, Salit J, Hackett N, Fuller J, Matthew R, Chouchane L,Rodriguez-Flores JL, Bustamante C, Crystal RG, Mezey JG. 2012.Inferring genome-wide patterns of admixture in Qataris usingfifty-five ancestral populations. BMC Genet. 13:49.

    Pardo-Diaz C, Salazar C, Baxter SW, Merot C, Figueiredo-Ready W, JoronM, McMillan WO, Jiggins CD. 2012. Adaptive introgressionacross species boundaries in Heliconius butterflies. PLoS Genet. 8:e1002752.

    Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y,Genschoreck T, Webster T, Reich D. 2012. Ancient admixture inhuman history. Genetics 192:10651093.

    Pinho C, Hey J. 2010. Divergence with gene flow: models and data. AnnuRev Ecol Evol Syst. 41:215230.

    Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, BeatyTH, Mathias R, Reich D, Myers S. 2009. Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populations.PLoS Genet. 5:e1000519.

    R Core Team. 2013. R: a language and environment for statistical com-puting. Vienna, Austria: R Foundation for Statistical Computing.[cited 2014 Aug 20]. Available from: http://www.R-project.org/.

    Rambaut A, Grass NC. 1997. Seq-Gen: an application for the MonteCarlo simulation of DNA sequence evolution along phylogenetictrees. Bioinformatics 13:235238.

    256

    Martin et al. . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018

    http://dx.doi.org/10.5061/dryad.j1rm6http://dx.doi.org/10.5061/dryad.j1rm6http://www.inkscape.orghttp://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msu269/-/DC1http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.R-project.org/

  • Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, ParraMV, Rojas W, Duque C, Mesa N, et al. 2012. Reconstructing NativeAmerican population history. Nature 488:370374.

    Reich D, Thangaraj K, Patterson N, Price AL, Singh L. 2009.Reconstructing Indian population history. Nature 461:489494.

    Rheindt FE, Fujita MK, Wilton PR, Edwards SV. 2014. Introgression andphenotypic assimilation in Zimmerius flycatchers (Tyrannidae):population genetic and phylogenetic inferences from genome-wide SNPs. Syst Biol. 63:134152.

    Roesti M, Hendry AP, Salzburger W, Berner D. 2012. Genome divergenceduring evolutionary diversification as revealed in replicate lake-stream stickleback population pairs. Mol Ecol. 21:28522862.

    Roux C, Tsagkogeorga G, Bierne N, Galtier N. 2013. Crossing thespecies barrier: genomic hotspots of introgression between twohighly divergent Ciona intestinalis species. Mol Biol Evol. 30:15741587.

    Sankararaman S, Mallick S, Dannemann M, Prufer K, Kelso J, Paabo S,Patterson N, Reich D. 2014. The genomic landscapeof Neanderthal ancestry in present-day humans. Nature 507:354357.

    Sankararaman S, Sridhar S, Kimmel G, Halperin E. 2008. Estimatinglocal ancestry in admixed populations. Am J Hum Genet. 82:290303.

    Smith J, Kronforst MR. 2013. Do Heliconius butterfly species exchangemimicry alleles? Biol Lett. 9:20130503.

    Staubach F, Lorenc A, Messer PW, Tang K, Petrov DA, Tautz D. 2012.Genome patterns of selection and introgression of haplotypes innatural populations of the house mouse (Mus musculus). PLoSGenet. 8:e1002891.

    Wall JD, Yang MA, Jay F, Kim SK, Durand EY, Stevison LS, Gignoux C,Woerner A, Hammer MF, Slatkin M. 2013. Higher levels of nean-derthal ancestry in East Asians than in Europeans. Genetics 194:199209.

    Wickham H. 2007. Reshaping data with the reshape package. J StatSoftw. 21(12):1213.

    Wickham H. 2009. ggplot2: elegant graphics for data analysis. New York:Springer. p. 213.

    Wickham H. 2011. The split-apply-combine strategy for data analysis.J Stat Softw. 40(1):129.

    Wu C. 2001. The genic view of the process of speciation. J Evol Biol. 14:851865.

    Yang MA, Malaspinas A-S, Durand EY, Slatkin M. 2012. Ancient struc-ture in Africa unlikely to explain Neanderthal and non-African ge-netic similarity. Mol Biol Evol. 29:29872995.

    Yang Z. 2010. A likelihood ratio test of speciation with gene flow usinggenomic sequence data. Genome Biol Evol. 2:200211.

    257

    Locating Introgression . doi:10.1093/molbev/msu269 MBE

    Downloaded from https://academic.oup.com/mbe/article-abstract/32/1/244/2925550by gueston 29 January 2018