inferring effectivemigrationfrom β¦stephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfΒ Β· 1...
TRANSCRIPT
D E S I S L A V A P E T K O V A
INFERRINGEFFECTIVE MIGRATION FROMGEOGRAPHICALLY INDEXEDGENETIC DATA
T H E U N I V E R S I T Y O F C H I C A G O
Contents
1 Population Structure in Genetic Variation 6
2 Population Structure due to Migration 8
3 Genetic Dissimilarities and Distance Matrices 20
4 Estimating Effective Rates of Migration 28
5 Simulations of Structured Genetic Data 35
6 Empirical Results 42
7 Appendices 58
8 Bibliography 77
List of Figures
2.1 A genealogy describes the ancestral history of a genotyped sample 8
2.2 A randomwalk approximates the migration process in a population graph 182.3 Effective resistances approximate expected coalescence times: relative error 19
5.1 Population structure under uniform migration 355.2 Population structure due to a barrier to migration 365.3 Uncertainty in the inferred migration surface 365.4 Barrier to migration with ascertainment bias 385.5 Population structure due to differences in deme size 395.6 A past demographic event results in a barrier to effective migration 395.7 Barrier to migration with uneven sampling 40
6.1 Habitat of the red-backed fairywren with the Carpentarian barrier 426.2 PCA and STRUCTURE analysis of the red-backed fairywren data 436.3 Distance scatterplot for the red-backed fairywren data 446.4 Triangular population graph spans thehabitat of the red-backed fairywren 446.5 Inferred effective migration surface for the red-backed fairywren 456.6 Uncertainty in the inferred effectivemigrationof the red-backed fairywren 456.7 Triangular population graph spans the habitat of the African elephant 466.8 PCA analysis of the elephant data 476.9 Inferred effective migration surface for the African elephant 476.10 Effective migration rates at each of sixteen microsatellites 486.11 Inferred effective migration surface for the savanna and forest elephants 486.12 GENELAND analysis of the African elephant data 496.13 STRUCTURE analysis of the African elephant data 496.14 Distance scatterplots for the African elephant data 506.15 Sample configuration and PCA analysis of the European and African data 516.16 Distance scatterplots for the European and African data 526.17 Inferred effective migration for human populations in Europe and Africa 536.18 Sample configuration and PCA analysis of Arabidopsis thaliana data 546.19 Inferred effective migration surfaces for Arabidopsis thaliana 556.20 Distance scatterplots for the Arabidopsis thaliana data 56
7.1 ms command: uniform migration on a regular triangular grid 747.2 ms command: barrier to migration on a regular triangular grid 747.3 ms command: barrier to effectivemigrationdue todifferences inpopulation size 757.4 ms command: uniformeffectivemigrationdespite differences in population size 757.5 ms command: barrier to effective migration due to a split in time 76
4
Genetic data often exhibit patterns that are broadly consistent with ''iso-
lation by distance'' β a phenomenon where genetic similarity tends to
decaywith geographic distance. In a heterogeneous habitat, decaymay oc-
curmore quickly in some regions than others: for example, barriers to gene
flow such as mountains or deserts could accelerate the genetic differenti-
ation between neighboring groups. In this thesis we present a method to
quantify and visualize variation in effective migration across the habitat,
and, under further assumptions, to infer the presence or absence of barri-
ers to migration, from geographically indexed large-scale genetic data.
inferring effective migration from geographically indexed genetic data 5
First and foremost, I would like to express my deepest gratitude to my su-pervisor, Professor Matthew Stephens, for his guidance and encourage-ment throughout the development of this project. His tremendous sup-port made this formidable journey less complicated, if not easier.
I am also grateful to my colleagues at the Departments of Statisticsand Human Genetics for their inspiring companionship, for their collec-tive critical eye, but above all, for their advice, assistance and guidance onmany difficult problems.
I am indebted to so many but I especially wish to acknowledge the ef-forts, love and encouragement of my parents. ey watched me from adistance while I worked towards my degree. e completion of this thesiswould mean a lot to them, so I dedicate this project to my parents, Emaand Ivo.
And finally, I would like to thank my sister and my friends for allow-ing me to realize my own potential and without whose love, affection andencouragement this thesis and many other pursuits would not have beensuccessful.
1
Population Structure in Genetic Variation
e term ''population structure'' is used to describe nonrandom patterns of genetic sim-ilarity (or alternatively, dissimilarity) between individuals from the same species. Onetask is to detect such patterns; this is often done in association studies because system-atic ancestry differences between cases and controls that are not genetic risk factors forthe disease can bias the results of the study [Price et al., 2010]. A more challenging taskisAdmixture is partial ancestry from two
or more distinct subpopulations as theresult of interbreeding.
to explain population structure as the outcome of events in the evolutionary historyof the species such as splits or admixture (events in time) and/or migration (events inspace) [Lawson and Falush, 2012].
Twowidely used approaches for inferring genetic ancestry are principal componentsanalysis and model-based clustering. In both cases, interpretation of results and infer-ence of demography are founded on the assumption that sample structure is evidencefor population structure, to the exclusion of other possible sources such as family struc-ture, cryptic relatedness or sample processing artifacts.
Principal component analysis (PCA) was first used in population genetics to sum-marize human genetic variation across continents [Menozzi et al., 1978, Cavalli-Sforzaet al., 1994]. eir synthetic maps of allele frequency variation show gradients thatcould support hypotheses for specific migration events such as the spread of Neolithicfarming. is interpretation of PCA maps is not universally accepted because PCA canproduce similar wave patterns in simulated spatial data, where gradients result fromlocal dispersal and not directed migration [Novembre and Stephens, 2008]. However,even though PCA might not explain what processes generated the structured variationin genetic data, themethod has been successfully applied to detect population stratifica-tion and infer genetic ancestry. For example, the top principal components of the sam-ple covariancematrix across a large number of (randomly selected) SNPs align well withgeographic distribution in some datasets [Novembre et al., 2008, Wang et al., 2012].
Alternatively, population structure can be analyzed with a model-based clusteringapproach. For example, STRUCTURE [Pritchard et al., 2000] assigns individuals into πΎgenetically homogeneous subpopulations [i.e., randommating and hence under Hardy-Weinberg equilibrium], with individual-specific ancestry proportions. As a clusteringalgorithm, STRUCTURE assumes the number of clusters is known. Even more impor-tantly, it uses a discrete model of population structure that is most applicable wherehigh level of divergence have resulted into well differentiated clusters.
Both PCA and STRUCTURE can produce results that are difficult to interpret. Forexample, STRUCTURE can fail if the population consists of distinct groups character-ized by small differences in allele frequencies, or a single population where the distribu-tion of allele frequencies varies continuously across space. In both cases, it is hard for aclustering algorithm to distinguish between clusters, or find the correct number of clus-
inferring effective migration from geographically indexed genetic data 7
ters. e study design can also influence the extent of observed ''clusteredness'' asmanydatasets consist ofmultiple observations from few locations [Serre and PÀÀbo, 2004]. Ifthe sample configuration does not represent the geographic distribution of the species,muchnaturally occurring genetic variation remains unobserved. And indeed the geneticdifferentiation of a widely distributed species such as humans is likely to exhibit evi-dence for both clusters, which correspond to discontinuous jumps in allele frequenciesacross large barriers such as oceans or the Himalayas, and clines, which reflect smoothgradations in allele frequencies across unbroken geographic regions [Rosenberg et al.,2005].
In the case of low differentiation between clusters, STRUCTURE results can be im-proved with a stronger, more informative prior on cluster membership. For example,[Hubisz et al., 2009] introduces a prior that places more weight on cluster assignmentsthat are correlated with sampling locations (because origin is often informative aboutancestry). In another modification of STRUCTURE that incorporates geographic infor-mation, [Guillot et al., 2005] explicitly models the distribution of clusters across thehabitat to encourage spatially continuous clusters (because subspecies often occupy lo-cally connected areas).
In the case of smoothly varying population structure, it is not appropriate to as-sign individuals to a fixed number of distinct clusters, even if the clustering methodallows fractional membership. PCA is effective in presenting continuous variation andPC projections are related to the underlying genealogical process [McVean, 2009]. How-ever, the algorithm is not based on a population genetics model, so it does not estimaterelevant demographic parameters, and its results are strongly affected by uneven sam-pling.
Genetic data often exhibit patterns that are broadly consistentwith "isolationbydis-tance" [Weiss and Kimura, 1965, Rousset, 1997] where genetic similarity tends to decaywith geographic distance. at is, a population inwhich the exchange ofmigrants is con-stant in both space and time still has structure as individuals that are close together are,on average, more genetically similar than individuals that are far apart [if reproductionand dispersal tend to occur locally/over small distances in every generation].
In a heterogeneous habitat, genetic similarity may decrease faster in some regionsthan others because a barrier to migration could accelerate the genetic differentiationbetween neighboring groupsβ thus creating patterns of population structure that arenot consistent with uniform migration. Here we develop a method aimed at investi-gating this kind of scenario. Specifically, we introduce a parametric model for geneticstructure that attempts to explain the spatial structure observed in geographically in-dexed large-scale genetic data in terms of effective rates of migration. We say "effective"because themodel's applicability to genetic data ismotivated under of series of assump-tions [most importantly, equilibrium in time] that mean estimated rates cannot be in-terpreted as actual rates of migration unless the assumptions are reasonably satisfied.However, even when estimated population parameters are not directly interpretable interms of demographic history, our method provides an intuitive and informative wayto quantify and visualize spatial patterns of population structure.
...
1
.(π)
.
1
.(π)
.
1
..
0
..
0
..
0
.
β’
.π‘ππ
.
π‘ππππ β π‘ππ
Figure 2.1: is genealogy specifies onepossible history for a sample of size 6.Since exactly one mutation occurs, de-noted by β’, we observe a pattern of both0s and 1s. Regardless of which branch car-ries the mutation, the events 'only 0s' and'only 1s' are excluded.
2
Population Structure due to Migration
In this background chapter we explain how population structure is reflected in observedgenetic data via the genealogy of the sample and review briefly a mathematical modelfor spatially structured populations.
In natural populations,mating is not randomdue to a complexmixture of evolution-ary and ecological factors. Non-randommating creates structure in genetic variation asclosely related individuals tend to be more similar genetically than distantly related in-dividuals. us shared ancestry leads to genetic similarity [Section 2.1].
An important factor for non-random mating is geographic distance as two individ-uals located close in space are more likely to reproduce than two individuals far apart.us geographic proximity leads to genetic similarity β a phenomenon called isolationby distance [Section 2.2].
A population geneticsmodel that exhibits isolation by distance is Kimura's stepping-stone model [Section 2.3]. It represents a spatially distributed population as a graphwhere vertices are groups of randomly mating individuals (called demes) and edges aredirect routes of migration. us demes that are closer together in the graph tend to bemore similar.
In fact, the stepping-stone model can capture the effect of both geographic distanceand heterogeneous habitat on genetic similarity as edges can have different migrationrates to reflect heterogeneity in gene flow. is weighted population graph describesprecisely what it means for two demes to be ''close together'' [Section 2.4].
2.1 Pairwise expected coalescence times explain population structure
emore closely related two individuals are, themore genetically similar they are. ere-fore, the genetic similarities observed in a sample contain information about the evo-lutionary processes undergone by the entire population. In this section we explain theconnection between genealogical histories and genetic similarities; the review is largelybased on [McVean, 2009].
Let π§1, β¦ , π§π be the genotypes of π individuals at a single segregating locus. Forsimplicity, assume the genetic markers are biallelic (e.g., SNPs): each individual carrieseither the ancestral allele, labeled '0', or the derived allele, labeled '1'.
Although life occurs forward in time and in discrete generations, it is often moreconvenient to model the ancestry of a sample backwards in time using a continuous-time process called the coalescent [Kingman, 1982b,a] that traces the lineages backwardsin time until their convergence into a single common ancestor. us the coalescentconstructs the history of the sample, at a single locus, in the form of a genealogical tree[Figure 2.1]. e most important demographic functions of the genealogy are
inferring effective migration from geographically indexed genetic data 9
β’ the time to the most recent common ancestor π‘ππππ [the height of the tree];
β’ the total size of the tree π‘π‘ππ‘ [the sum of all branches];
β’ the pairwise time to coalescence π‘ππ for every pair (π, π) [the length of the path from π,or equivalently from π, to the most recent common ancestor of π and π].
With a slight abuse of notation, let πΎπ‘ denote the number of mutations that occur ona path with length π‘. If mutations are generated by a Poisson process with intensity[mutation rate] π, the probability that the path accumulates a mutation depends onlyon its relative length, not on its position within the genealogy. In particular, P{πΎπ‘ =0} = E{πβππ‘}. Similarly, letπΎπ‘π‘ππ‘ denote thenumber ofmutations that occur throughoutthe genealogy. us {πΎπ‘π‘ππ‘ > 0} is the event that the site segregates in the sample. Fora fixed mutation rate π, the probability that at least one mutation occurs on a path withlength π‘ and none in the rest of the tree is
P{πΎπ‘ > 0, πΎπ‘π‘ππ‘βπ‘ = 0} = E{(1 β πβππ‘)πβπ(π‘π‘ππ‘βπ‘)}. (2.1)
Similarly, the probability that the site segregates is
P{πΎπ‘ππ‘ > 0} = E{1 β πβππ‘π‘ππ‘}, (2.2)
where the expectation is with respect to all possible genealogies of the sample.[Nielsen, 2000] argues that if we assume the mutation rate is low and condition on
the site segregating in the sample, then the mutation rate π is of little interest and soit can be treated as a nuisance parameter. Following [Nielsen, 2000] Alternatively, without explicitly making
the infinitely-many-sites assumption,we can ignore the probability of event{πΎπ‘ > 1} if the mutation rate π is verylow.
we can eliminateπ from the analysis by taking the limit π β 0. Under the infinitely-many-sites model[Kimura, 1969], the event of at least one mutation is equivalent to the event of exactlyone mutation. erefore, P{π‘ = 0} and P{π‘ = 1} are complementary events. Togetherwith the low mutation limit π β 0, this implies
P{πΎπ‘ = 1|πΎπ‘ππ‘ = 1} = P{πΎπ‘ > 0, πΎπ‘ππ‘ > 0}P{πΎπ‘ππ‘ > 0} =
P{πΎπ‘ > 0, πΎπ‘π‘ππ‘βπ‘ = 0}P{πΎπ‘ππ‘ > 0} (2.3a)
= limπβ0
πβ1E{πβπ(π‘π‘ππ‘βπ‘) β πβππ‘π‘ππ‘}πβ1E{1 β πβππ‘π‘ππ‘}
Interchange limit and expectation [valid
if E{π‘π‘ππ‘} < β] and use the Taylorapproximation πβπ₯ βΌ 1 β π₯.
(2.3b)
= E{π‘π‘ππ‘ β (π‘π‘ππ‘ β π‘)}E{π‘π‘ππ‘}
= E{π‘}E{π‘π‘ππ‘}
β‘ πππ‘ππ‘
, (2.3c)
where for convenience we denote the expectation of coalescence time π‘ by π.erefore, for biallelic markers and under the conditions specified above, there is a
relationship between expected coalescence times and the probability that a particularbranch in the genealogy carries the derived allele. We will use it to derive the first twomoments of the genotype vector π = (π1, β¦ , ππ).Proposition 2.1 Suppose that a sample of size π is collected from a population that evolvesaccording to the neutral infinitely-many-sitesmodel, wheremutations are generated by a Pois-son process with lowmutation rate. At segregating sites where exactly one mutation occurs inthe sample, the allele carried by individual π, denoted by ππ, is a binary random variable suchthat
Eβ{ππ} = πππππππ‘ππ‘
. (2.4)
Furthermore, for two distinct individuals π and π,
Eβ{ππππ} =πππππ β πππ
ππ‘ππ‘. (2.5)
10
Here the symbolβ indicates that the expectation is with respect to all possible sample genealo-gies with exactly one mutation.
Proof. In a genealogical tree with exactly one mutation, the πth lineage carries thederived allele if the mutation occurs anywhere on the path from πth external branch tothe most recent common ancestor of the entire sample. is path has length π‘ππππ forall π; its average length is πππππ. erefore, the conditional probability of observing thederived allele is the same for every individual and
Eβ{ππ} = E{πΎπ‘ππππ = 1|πΎπ‘ππ‘ = 1} = πππππππ‘ππ‘
. (2.6)
at is, the genotypes at a biallelic marker are Bernoulli random variables with fre-quencyπππππ/ππ‘ππ‘. Furthermore, since the genotypes are binary, the event {ππ = 1, ππ =1} β {ππππ = 1} implies that the mutation occurs on the branch from the pair's mostrecent common ancestor to the most recent common ancestor of the sample. is an-cestral branch has length π‘ππππ β π‘ππ. erefore, the conditional expectation that twoindividuals π and π carry a common mutation at a biallelic marker is given by
E{ππππ |πΎπ‘ππ‘ = 1} = P{πΎπ‘ππππβπ‘ππ = 1|πΎπ‘ππ‘ = 1} =E{π‘ππππ} β E{π‘ππ}
E{π‘π‘ππ‘}. (2.7)
l
us the individual genotypes have the same marginal distribution: the ππs are identi-cally distributed but not independent. Finally, in equations (2.4) and (2.5) the expectedcoalescence times πππππ, ππ‘ππ‘, πππ are marginal expectations with respect to all possiblehistories [genealogies] of the sample, not only histories that can induce the observedpattern of 0s and 1s.
e principle behind equation (2.5) states that the more history two individualsshare, the more genetically similar they are. Here we should interpret "shared history"precisely as "common ancestral branch" in the genealogy rather than broadly as a "de-mographic past" in the sense of evolutionary history. Different models can produce thesame expected genealogy. For example, a long branch separating two samples couldcorrespond to a split into distinct subpopulations some time in the past or constantmi-gration between two locations at a low rate. Conversely, without further assumptions,patterns of similarities and differentiation observed in genetic data reveal informationabout the underlying genealogies, and hence, indirectly, about the demographic modelthat generated them. In this thesis we average observed genetic similarities acrossmarkers; thus we ignore information (e.g., the variance) that could in principle improvethe ability to distinguish demographic models.
2.1.1 Bias due to SNP ascertainment
Ascertainment bias refers to systematic deviations in the SNP discovery process where asmall number of individuals are used to find sites polymorphic in the entire population[Clark et al., 2005]. In particular, rare SNPs are harder to ascertain and more likely tobe underrepresented. Furthermore, the genetic variation in a geographic region couldbe misrepresented in a panel with unbalanced sample configuration. [McVean, 2009]observes that two samples are effectively involved in ascertainment β first a panel todiscover SNPs for genotyping on a microchip and then a sample to genotype. We condi-tion on sites that segregate in both samples and this can distort (the average shape of)the observed genealogies and thus produce misleading results. In this thesis we ignoreSNP ascertainment as a potential source of sample structure.
inferring effective migration from geographically indexed genetic data 11
2.2 Isolation by distance in a spatially distributed population
Geographic separation can act as a genetic barrier because in a natural population mi-gration tends to be local rather than long-range. If long-distance migration events arerare, a mutation that arises in one area might take a long time to spread throughout thehabitat (if at all). Consequently, individuals that are closer together tend to be moresimilar genetically than those that are far apart. is phenomenon is known as isola-tion by distance. However, the relationship between geography and genetic similarityalso depends on dispersal. If the habitat is homogeneous andmigration is characterizedby the same dispersal density everywhere, genetic similarity decreases as a function ofrelative distance.
e effect of subdivision on population structure is often quantified in terms of astatistic called πΉππ that measures the genetic variation among subpopulations relativeto the total genetic variation. Several definitions of πΉππ [Wright, 1943] introduces πΉππ as the
statistic var{π}/[sπ(1 β sπ)] where var{π}is the variance in allele frequency amongsubpopulations and sπ is the overallmean allele frequency in the population.Intuitively, πΉππ is high when individualsare similar within subpopulations anddifferent between subpopulations.
have been proposed [Wright,1943, Cockerham, 1969, Nei, 1973]. We use Nei's definition where πΉππ is a functionof the probabilities of identity within and between subpopulations. [Two lineages areidentical, at a given locus, if they carry the same allele.] e πΉ-statistic is defined as
πΉππ = π0 β π1 β π , (2.8)
where π is the probability of identity for two individuals chosen at random withoutreference to geography, and π0 is the probability of identity for two individuals chosenat random from the same subpopulation.
As ameasure of genetic differentiation, theπΉ-statistic is related to coalescence timesbecause identity means neither lineage accumulates a mutation in the time to most re-cent common ancestor. If the mutation process is Poisson with low mutation rate π,
π(π) = E{πβππ‘} β 1 β πE{π‘}. (2.9)
In this case, by substituting π0 = 1 β π0 and π = 1 β π in equation (2.8), [Slatkin,1991] derives the approximation
πΉππ β π β π0π , (2.10)
where π0, π are the expected coalescence times for a pair of distinct lineages sampledat random from the same subpopulation and from the entire population, respectively.e coalescent-based approximation to the πΉ-statistic is very general: [Slatkin, 1991]derives it in the low mutation limit but otherwise makes no assumptions about thedemographic model. us, the approximation holds under a subdivided population atequilibrium, a growing population, or a population that has undergone a split some timein the past.
By analogy, [Rousset, 1997] considers the πΉ-statistic for two demes separated bydistance π₯,
πΉππ(π₯) = π0 β ππ₯1 β ππ₯
β ππ₯ β π0ππ₯
, (2.11)
as well as the linearized πΉ-statistic given by
πΉππ(π₯)1 β πΉππ(π₯) β ππ₯ β π0
π0. (2.12)
[Rousset, 1997] analyzes the relationship between genetic differentiation, πΉππ , and ge-ographic distance, π₯, in a spatially-homogeneous stepping-stone model where demes
12
are equally sized and regularly spaced (on a ring in one dimension and a torus in twodimensions), and migration is determined by a symmetric dispersal kernel. e impor-tant demographic parameters are the effective population density π· per length/areaunit and the mean squared dispersal distance π2, which determines the speed at whichtwo lineages with a common ancestor move away from each other in a generation. Bysymmetry, the probability of identity for two randomly sampled individuals is also afunction of the relative distance π₯. [Rousset, 1997, 2004] derives the following large-distance approximations to the linearized πΉππ ,
πΉππ(π₯)1 β πΉππ(π₯) β π₯
4π·π2 + πΆ1; (in one dimension) (2.13a)
πΉππ(π₯)1 β πΉππ(π₯) β ln(π₯/π)
4ππ·π2 + πΆ2; (in two dimensions) (2.13b)
where the constants πΆ1 and πΆ2 depend on the population density and the dispersaldistribution but not on the population sizes or the mutation rate.
erefore, if migration is uniform, the linearized πΉππ increases with geographic dis-tance. is relationship is appropriate only for homogeneous habitats as it ignores theeffect of barriers (or corridors) to migration: two demes separated by a barrier wouldappear to be more genetically dissimilar than relative distance would suggest. In otherwords, we need a measure of effective distance to describe the patterns of movementacross the habitat.
2.3 e stepping-stone model of population subdivision
Section 2.1 explains that coalescence times represent population structure because ge-netically similar individuals are likely tohave a recent commonancestor and thus shortercoalescence time. e relationship between genetic correlation and coalescence times inequation (2.5) is very general. For example, [McVean, 2009] uses as an example amodelof population split in which groups derived from a common ancestor do not exchangemigrants and thus develop independently after the split. In this thesis we aim to an-alyze the spatial structure of genetic variation, and therefore, we need to model [andapply equation (2.5) to] a spatially distributed population.
Kimura's stepping-stone model [Kimura and Weiss, 1964] represents a populationacross the span of its habitat as a connected grid of panmictic (randomlymating) demes(colonies) which exchangemigrants in a fixed pattern. For simplicity, in this chapter weconsider a haploid population.A haploid organism has a single copy of
its genome; a diploid organism has twocopies, one inherited from the father andthe other from the mother.
To extend the framework, a diploid individual can berepresented as the sum of two independent haplotypes, one from each parent.
e stepping-stone model makes the following assumptions:
β’ ere are π demes and deme πΌ consists ofππΌ randomlymating individuals. e totalpopulation number is ππ = βπΌ ππΌ and the average deme size is π0 = ππ/π. edemes remain constant in size and ππΌ βΌ πͺ(π0) for all πΌ.
β’ e mutation rate per site per generation is π’ and the scaled mutation rate for twodistinct lineages in π0 generations is π = 2π0π’.
β’ e coalescence ratee ancestral process develops backwardsin time, from the present towards thepast. A coalescence event means that twoindividuals have the same parent and amigration event means that an individualfrom πΌ has a parent from π½.
for a pair of distinct lineages drawn at random from deme πΌis ππΌ = π0/ππΌ βΌ πͺ(1). Two lineages coalesce when they merge into a commonancestor and in a single generation this event has probability 1/ππΌ.
β’ e migration rate for a lineage to move from deme πΌ to deme π½ β πΌ is ππΌπ½ βΌ
inferring effective migration from geographically indexed genetic data 13
πͺ(1). e migration matrix π = (ππΌπ½), where ππΌπΌ = β βπ½βΆπ½β πΌ ππΌπ½, describesthe transition process of a single lineage backwards in time.
All rate parameters are constant in times and on the scale of π0 generations. e as-sumptions ππΌ βΌ πͺ(1) for every deme πΌ and ππΌπ½ βΌ πͺ(1) for every pair (πΌ β π½) implythatmigration isweak. at is, the probability ofmultiplemigration and/or coalescenceevents occurring in the same generation [before scaling by π0] is πͺ(πβ2
0 ) and can beignored.
e stepping-stone model describes how a spatially distributed population evolvesunder equilibrium in time, i.e., under the condition that bothmigration and coalescencerates are the same in every generation. erefore, the model can characterize system-atic differences between the groups due to gene flow but not due to splits or admixtureevents. In other words, the stepping-stonemodel can represent population structure inspace but not in time. [As we show through simulations in Chapter 5, temporal struc-ture can be explained as spatial structure, in terms of effective rates of migration.]
If demes of constant size exchange migrants at fixed rates as required under equi-librium, the number of individuals to emigrate is equal the number of individuals toimmigrate, i.e., migration is conservative [Nagylaki, 1980]. Mathematically,
βπ½βΆπ½β πΌ
ππΌπ½/ππΌ = βπ½βΆπ½β πΌ
ππ½πΌ/ππ½ β πβ²πβ1 = 0 (2.14)
where πβ1 = (πβ1πΌ ) = (ππΌ/π0) is the vector of coalescence rates.In a general stepping-stone model, migration is not necessarily symmetric. How-
ever, in this thesis we assume that ππΌπ½ = ππ½πΌ for all edges (πΌ, π½). e condition thatmigration is both symmetric and conservative implies that all demes have the same size:on one hand, ππβ1 = πβ²πβ1 = 0, and on the other, π1 = 0 as π is a Laplacian ma-trix; hence π β 1. us the average deme size π0 = ππ/π is a convenient choice forthe coalescent timescale.
e stepping-stone model characterizes dispersal not in terms of an explicit disper-sal density but indirectly through the combined effect of the graph topology and themigration rates. It may not seem natural to represent the geographic distribution of or-ganisms with a graph. However, discrete models for migration are common in popula-tion genetics. In fact, a continuousmodel of isolation by distance (with normal dispersaland continuous spatial distribution) can lead to inconsistencies [Felsenstein, 1975].
2.3.1 Expected coalescence times in a subdivided population
In Section 2.1 we described how the probability that two individuals both carry the de-rived allele is related to the expected coalescence time to their most recent commonancestor. We will use this connection between genetic similarity and shared ancestry toanalyze the spatial structure in genetic variation, and in particular, to estimate migra-tion rates. e inference procedure requires that we express pairwise coalescence timesas functions of migration rates.
e coalescent process can be extended to represent the ancestry of a sample fromthe stepping-stonemodel [Notohara, 1990, 1993]. is version, called the structured co-alescent, describes themovement of lineages between demes as well as their coalescenceinto common ancestors. We can use the properties of the structured coalescent [as wedo in Appendix 7.1] to derive the following system of linear equations for the pairwiseexpected coalescence times π = (ππΌπ½) as a function of the coalescence rates π = (ππΌ)
14
and the migration rates π = (ππΌπ½):
diag {π} diag {π} β ππ β ππβ² = 11β². (2.15)
Furthermore, if the migration rates are symmetric, as we assume throughout, thenthere is no variation in coalescence rates across demes, i.e., π = 1, π = πβ² and
diag {π} β ππ β ππ = 11β². (2.16)
In equation (2.15) ππΌπ½ is the expected coalescence time between two randomly chosenlineages, one from πΌ and the other from π½. In equation (2.5) πππ is the expected coales-cence times between two sampled individuals π β πΌπ and π β πΌπ. Crucially, the pair-
wise coalescence times do not depend on the sample configuration,Individuals are exchangeable withindemes but not across demes because thesample location is informative about thealleles an individual carries.
πΌ = (πΌ1, β¦ , πΌπ),because individuals are exchangeable within each deme. erefore, the expected coales-cence time for an observed pair (π β πΌπ, π β πΌπ) is the same as the expected coalescencetime for any pair (πβ² β πΌπ, πβ² β πΌπ) from the subdivided population:
πππ = ππΌππΌπ . (2.17)
Notation: We use Greek letters [πΌ, π½] to denote subpopulations and Latin letters [π, π]to denote sampled individuals. And we will distinguish between the population matrixπ = (ππΌπ½ βΆ demes πΌ, π½) and the sample matrix π = (πππ βΆ individuals π, π) whereπ = π(πΌ) β diag {π(πΌ)}. e diagonal is subtracted because coalescence time with selfis always 0.
In any population graph,ππΌπ½ > ππΌπΌ because coalescence is possible only for lineagesin the same deme. However, if πΌ and π½ are separated by a barrier, fewer migrants movebetween πΌ and π½, and so the pairwise coalescence times ππΌπ½ would be larger than thetime expected under isolation by distance, i.e., uniform migration. us, the matrix ofpairwise coalescence timesπ = (ππΌπ½)would contain evidence for habitat heterogeneity.
Since longer coalescent time mean less genetic similarity, coalescence times are anaturalmeasure of genetic dissimilarity andhencepopulation structure. For the stepping-stonemodel we can compute the matrix of expected coalescence times, π, given the mi-gration rates π and the coalescence rates π using equation (2.15). Alternatively, thereexists a computationally efficient method to approximate π, which we discuss next.
2.4 Isolation by resistance is a metric for gene flow
Isolation by resistance (IBR) [McRae, 2006, McRae et al., 2008] draws an analogy be-tween a subdivided population in which neighboring demes exchange migrants and anelectrical network in which current flows through conductors. [Or in other words, be-tween Kimura's stepping-stone model and an undirected random walk.] To understandthe analogy better, concepts in electrical networks can be given population genetic in-terpretation [Table 2.1]. Using this correspondence between population genetics andcircuit theory, McRae develops IBR to test whether putative barriers to genetic flow af-fect genetic differentiation.
Isolation by resistance predicts effective distances from a raster grid of landscaperesistance (or friction): each cell in the grid specifies how difficult it is for an animalto migrate locally and these values are assigned based on expert knowledge about thespecies and the habitat. If the effective distances agree with the observed genetic dis-similarities, then the hypothesized grid explains the data well. Such a raster map couldbe hard to produce, especially at fine scales, and if the agreement is low, there is no
inferring effective migration from geographically indexed genetic data 15
Electrical term Ecological interpretation
conductance
ππ₯π¦ βΆ β(π₯, π¦) β πΈ direct migration ππΌπ½: the number of migrants exchanged be-tween two neighboring demes πΌ and π½ in a single generation.
(On the coalescent timescale ππΌ = π0οΏ½ΜοΏ½πΌπ½ where οΏ½ΜοΏ½πΌπ½ is theprobability that a lineage in πΌ has a parent from π½.)
resistance
ππ₯π¦ = 1/ππ₯π¦cost 1/ππΌπ½: measure of local landscape friction in the directionfrom πΌ to π½. If migration is symmetric, ππΌπ½ = ππ½πΌ.
(Since ππΌ = π0, the ππΌπ½s are comparable across the habitat.)effective conductance
πΆπ₯π¦ βΆ β(π₯, π¦) β π Γ π effectivemigrationππΌπ½: the number ofmigrants thatwould pro-duce the same level of genetic differentiation between πΌ and π½ ifthese two demes made up a two-deme system.
effective resistance
π π₯π¦ = 1/πΆπ₯π¦distance metric π π₯π¦: quantifies the genetic differentiation be-tween a pair of demes (πΌ, π½) by taking into account the existenceof multiple pathways between them.
Table 2.1: Circuit theory concepts andtheir ecological interpretation, adaptedfrom [McRae et al., 2008]. McRae spec-ifies the edge conductances as ππΌπ½ =ππΌπ½/ππΌ. However, it is natural to defineconductances only in terms of the migra-tion process because lineages cannot coa-lesce until they meet.
method to facilitate improving the map of resistances. However, IBR does provide anuseful and efficient approximation to expected coalescence times.
To begin with, consider a stepping-stone model that has only two demes, πΌ and π½,with equal size and a single edge with migration rate ππΌπ½. In this population, alsoknown as a two-island model,
ππΌπ½ =(ππΌπΌ + ππ½π½)/8
ππΌπ½ β (ππΌπΌ + ππ½π½)/2 . (2.18a)
[is follows from the system of linear equations (2.15).] e two-islandmodel is a veryspecial case and the equation (2.18a) does not hold more generally. In fact, unless thepopulation graph is fully connected, many pairs of demes might not exchange migrantsdirectly and then ππΌπ½ = 0. However, [McRae, 2006] extends the relevance of the re-lationship (2.18a) to the general stepping-stone model by introducing the concept ofeffective migration ππΌπ½ between πΌ and π½. It is given by
ππΌπ½ β‘(ππΌπΌ + ππ½π½)/8
ππΌπ½ β (ππΌπΌ + ππ½π½)/2 . (2.19)
at is, the effective migration ππΌπ½ is the number of migrants that would producethe actual genetic differentiation between πΌ and π½ in a hypothetical two-island system.Since two lineages take the same time to reach their common ancestor, ππΌπ½ = ππ½πΌ andthe definition (2.19) implies that effective migration is always symmetric even thoughthe underlying true migration patterns might not be symmetric.
It is natural to relate the concept of effective migration in a subdivided population,ππΌπ½, and the concept of effective conductance in an electrical circuit, πΆπΌπ½. In circuittheory, πΆπΌπ½ is the conductance in a two-node, single-conductor network required toproduce the same amount of current between πΌ and π½ as in the original network.
Proposition 2.2 Consider apopulation graph (π, πΈ)with symmetricmigration rates {ππΌπ½ βΆβ(πΌ, π½) β πΈ}. is corresponds to a circuit network (π, πΈ) with conductances {ππΌπ½ = ππΌπ½}.For every pair (πΌ, π½) β π Γ π, the effective conductance πΆπΌπ½ in the circuit is a measure ofthe effective migration ππΌπ½ in the population:
ππΌπ½ β πΆπΌπ½. (2.20)
And thus the resistance distance π πΌπ½ = 1/πΆπΌπ½ is a measure of genetic differentiation.
16
Proof. e relationship between effective migration and effective conductance is exactonly if migration is isotropic, i.e., demes are equivalent with respect to the size and pat-tern of movement. Here we assume only that migration is symmetric and conservative.
e migration process can be represented as a continuous-time discrete-space ran-domwalk on an undirected graph [Levin et al., 2008]. enπ = (ππΌπ½) is the transitionkernel of the embedded jump chain, which determines the sequence of locations occu-pied by the lineage, and ππΌ = βπ½βΆπ½β πΌ ππΌπ½ are the rates of the holding distributions,which determine the waiting times before jumps. Let π = (1/π) βπΌ ππΌ be the averageholding rate.
Since migration is symmetric and conservative, the demes have the same size π0,which is also a convenient choice for the coalescent timescale. Let π0 be the averagewithin-deme expected coalescence time. en by Strobeck's theorem [see equation (7.8)in Appendix 7.1],
π0 β‘ βπΌ
ππΌπΌ/π = π. (2.21)
us π0 does not depend on the migration process.Furthermore, letππΌπ½ be the expected time for two lineages, one fromπΌ and the other
from π½, to occupy the same deme. en
(ππΌπΌ + ππ½π½)/2 β π0, (2.22a)
ππΌπ½ β (ππΌπΌ + ππ½π½)/2 β ππΌπ½. (2.22b)
ese two approximations are exact ifmigration is isotropic: since the demes are equiva-lent with respect to themigration process, the within-deme coalescence timesππΌπΌ mustbe equal by symmetry. Hence, ππΌπΌ = π0, ππΌπ½ = ππΌπ½ + π0 and once the lineages meetfor the first time, we can restart the random walk with two lineages in the same demechosen at random.
Under the coalescent process, two lineages β one from πΌ and another from π½ βmove simultaneously until they coalesce into a common ancestor. Suppose that theymeet for the first time in deme πΎ. Together the paths πΌ β πΎ and π½ β πΎ have half thelength of a commute between πΌ and π½ that passes through πΎ. erefore, the expectedtime to first meet, ππΌπ½, can be related to the expected commute length, πΎπΌπ½, in thecorresponding circuit network:
ππΌπ½ β πΎπΌπ½/(4π), (2.23)
where πΎπΌπ½ is the expected number of jumps in a random walk that starts at πΌ, visits π½and returns to πΌ, and 1/(2π) is the average waiting time before either lineage jumps.e relationship is approximate because the waiting time varies across vertices.
Finally, by [Chandra et al., 1996] for a undirected graph [whether isotropic or not],
πΎπΌπ½ = ππΊπ πΌπ½ = ππΊ/πΆπΌπ½, (2.24)
where π πΌπ½ is the effective resistance between nodes πΌ and π½, πΆπΌπ½ is the effective con-ductance, and ππΊ is the total conductance of the network given by
ππΊ = βπΌ
βπ½βΆπ½β πΌ
ππΌπ½ = βπΌ
ππΌ = ππ. (2.25)
erefore,
ππΌπ½ =(ππΌπΌ + ππ½π½)/8
ππΌπ½ β (ππΌπΌ + ππ½π½)/2 β π0/4ππΌπ½
β (π/4)π πΌπ½(ππ)/(4π) = πΆπΌπ½ (2.26)
l
inferring effective migration from geographically indexed genetic data 17
Essentially,McRae's approximation splits the between-deme coalescence time,ππΌπ½, intothe time to firstmeet, ππΌπ½, and the averagewithin-deme coalescence time,π0. However,since the population graph is not necessarily symmetric, not every deme πΎ is equallylikely to be the deme where two lineages, starting from πΌ and π½, meet for the first time.And furthermore, the within-deme coalescence times are not necessarily equal. ere-fore, the effective resistancemetric reflects themigration process accurately but ignoresthe fact that the lineages do not necessarily coalesce on their first opportunity. On theother hand, the coalescence time metric correctly captures the effect of both processesbecause Kingman's coalescent models migration and coalescence by explicitly trackingboth lineages until their common ancestor. Since higher rates imply faster mixing, wecan conclude that the higher migration rates are, the better McRae's approximation is.See Figures 2.2 and 2.3.
2.4.1 Effective resistance approximates expected coalescence time
McRae's method approximates the ancestral process of two lineages evolving simulta-neously in terms of one lineage evolving at twice the rate. However, one random walkcannot represent a coalescence event where two lineages merge into their most recentcommon ancestor. us, while effective resistance, π πΌπ½, provides a measure for thegenetic differentiation between demes, it does not capture the genetic differentiationbetween individuals from the same deme [π πΌπΌ = 0 for every deme πΌ]. However, itfollows directly fromMcRea's approximation that
ππΌπ½ β ππΌπ½ + π0 β π0(π πΌπ½/4 + 1), (2.27)
or equivalently in matrix notation,
π β π0(π /4 + 11β²). (2.28)
emain advantage of approximating coalescence times in terms of effective resistancesis computational efficiency. To computeπ, we solve a linear system of equationsπ΄π = π₯with π(π + 1)/2 unknowns that corresponds to eq. (2.16). In this problem π΄ is sparse(because the population graph πΊ is sparse) and positive definite, and so we can use aniterative preconditioned gradient method. ere are several methods to compute π ; weuse a method that inverts the π Γ π matrix π + 11β² [BabiΔ et al., 2002]. Since π΄ is ofhigher order than π, it is more efficient to compute π . Furthermore, π gives a verygood approximation to π whenmigration rates are high and it is more appropriate thanother distance metrics such Euclidean distance and least-cost path. erefore, effec-tive resistance offers a compromise between accuracy of representation and efficiencyof computation.
In this chapter we introduced two important components of ourmethod for analyz-ing spatial population structure: the stepping-stone model and the effective resistancemetric. In the next chapters we describe how we can estimate and visualize effectiverates of migration from geographically referenced genetic data.
18
Tab - (Taa+Tbb)/20 100 200 300 400 500
0
100
200
300
400
500
isotropic on a circle | m = 0.01R a
b(d/
4)
Tab - (Taa+Tbb)/20 1 2 3 4 5
0
1
2
3
4
5
m = 1
R ab(
d/4)
Tab - (Taa+Tbb)/20.00 0.01 0.02 0.03 0.04 0.05
0.00
0.01
0.02
0.03
0.04
0.05
m = 100
Tab - (Taa+Tbb)/20 500 1000
0
500
1000
1500
uniform on a grid | m = 0.01
R ab(
d/4)
Tab - (Taa+Tbb)/20 5 10 15
0
5
10
15
m = 1R a
b(d/
4)
Tab - (Taa+Tbb)/20.00 0.05 0.10 0.15
0.00
0.05
0.10
0.15
m = 100
Tab - (Taa+Tbb)/20 200 400 600 800 1000
0
200
400
600
800
1000
barrier on a grid | m = 0.01
R ab(
d/4)
Tab - (Taa+Tbb)/20 2 4 6 8 10
0
2
4
6
8
10
m = 1
R ab(
d/4)
Tab - (Taa+Tbb)/20.00 0.02 0.04 0.06 0.08 0.10
0.00
0.02
0.04
0.06
0.08
0.10
m = 100
Figure 2.2: On the π₯-axis, ππΌπ½ β (ππΌπΌ +ππ½π½)/2 is the expected time to reach thesame deme; on the π¦-axis,π πΌπ½(π/4) is the(appropriately scaled) effective resistance.As the migration rate increases, π πΌπ½ be-comes a better approximation of the ex-pected time to first meet, ππΌπ½, even if mi-gration is not isotropic. [Results for a 5Γ4regular triangular grid with uniform mi-gration rate π = 0.01, 1 or 10.]
inferring effective migration from geographically indexed genetic data 19
Tab
0 100 200 300 400 5000
100
200
300
400
500
isotropic on a circle | m = 0.01
d(R a
b4+
1)
Tab
20 21 22 23 24 25
20
21
22
23
24
25
m = 1
d(R a
b4+
1)
Tab
20.00 20.01 20.02 20.03 20.04 20.05
20.00
20.01
20.02
20.03
20.04
20.05
m = 100
Tab
0 500 1000
0
500
1000
1500
uniform on a grid | m = 0.01
d(R a
b4+
1)
Tab
15 20 25 30
20
25
30
35
m = 1
d(R a
b4+
1)
Tab
19.95 20.00 20.05 20.10
20.00
20.05
20.10
20.15
m = 100
Tab
0 200 400 600 800 1000
0
200
400
600
800
1000
1200
barrier on a grid | m = 0.01
d(R a
b4+
1)
Tab
18 20 22 24 26 28 30
20
22
24
26
28
30
32
m = 1
d(R a
b4+
1)
Tab
19.98 20.00 20.04 20.08 20.10
20.00
20.02
20.04
20.06
20.08
20.10
20.12
m = 100
Figure 2.3: On the π₯-axis, ππΌπ½ is the ex-pected time to coalescence; on the π¦-axis,π(π πΌπ½/4 + 1) is the IBR approximation.e approximation to the within-deme co-alescence times, ππΌπΌ, is always π0 = π;there are the points closest to the originat π0 = 20 in a 5 Γ 4 grid. Althoughthe pattern does not change as the mi-gration rate increases, the relative errorβππΌπ½/ππΌπ½ decreases.
3
Genetic Dissimilarities and Distance Matrices
Habitat heterogeneity can shape genetic variation by reducing or increasing gene flow.e stepping-stone model is a natural representation of a spatially distributed popula-tion and the effects of gene flow on its genetic structure. In this thesis a population isa graph πΊ = (π, πΈ, π) comprised of vertices π [randomly mating demes of equal size],edgesπΈ [symmetric routes ofmigration between neighboring demes] and aweight func-tion π βΆ πΈ β β+ that specifies the rates at which migrants are exchanged.
roughout, we will assume that the population graph πΊ is embedded in a two-dimensional habitat, with the vertex set π and the edge set πΈ both fixed. In practice,this graph is not known and does not necessarily exist. For example, it might not bepossible to split the population into distinct groups that satisfy the randommating as-sumption. Instead, we cover the habitat with a regular triangular grid in which verticesdo not represent actual colonies. is simplification indicates that we should interpretthe migration parameters carefully β as effective rather than actual rates of migration.
us the topology of the graph is determined by the shape of the habitat [and thesomewhat arbitrary choice that the graph is triangular and regularly spaced] and notthe sample configuration or the sample ''clusteredness''. And so we construct the graphdifferently from methods that aim to subdivide the population into clusters that aresimilar within and dissimilar between. However, if we make the grid (π, πΈ) sufficientlyfine, we can reasonably assume that each vertex represents a randomly mating groupwithout further structure. In this case, individuals would be similar within demes butnot necessarily dissimilar between demes.
In a habitat with uniformmigration, the genetic differentiation between individualsfrom the same species is positively correlated with the distance between their origin; ina heterogeneous habitat, landscape features such as barriers or corridors create spatialstructure in genetic variation. For example, individuals separated by a barrier are lessclosely related, and therefore less genetically similar, than if the barrier were absent.e stepping-stone model can represent such effects because some edges in the popu-lation graph can have high migration rates and others β low. In this thesis we developa Bayesian procedure to estimate the effective migration rates in a fixed grid (π, πΈ) ofequally sized demes, from geographically indexed genetic data. e function π mea-sures the relative rate at which two connected demes exchange migrants; we call π amigration surface.
To analyze population structure, we will assume that all genotyped sites develop un-der the same evolutionary process which determines the expected structure in geneticcorrelations (or equivalently, genetic distances). In contrast, many methods for asso-ciation testing assume that individuals are independent while sites are correlated. (Inpopulation genetics, the systematic association between loci is called linkage disequilib-
inferring effective migration from geographically indexed genetic data 21
rium.) e problem at hand determines which assumption is appropriate to make. e PCA decomposition of the observedcovariance matrix ππβ² can be used tocorrect for population stratification [Priceet al., 2006] incorporate the leadingeigenvectors in a regression analysis thattests for association between sites anddisease.
Tofind associations between disease status and genetic makeup, it is reasonable to assumethat the disease develops under the same mechanism in all sampled cases but not allsites contribute to the disease and not with equal effect. To analyze population struc-ture, it is reasonable to assume that the same evolution process underlies all genotypedsites but not all sampled individuals are genetically similar to equal degree.
In this chapter, let π = (π§π βΆ π = 1, β¦ , π) be a vector of π genotypes at a singlepolymorphic site. We will consider multiple sites in the next chapter. Also let πΌ =(πΌ1, β¦ , πΌπ) denote the sample configuration, in which πΌπ is the sampling location ofthe πth haplotype.
3.1 Mean and covariance of genotype vectors: SNPs
First we consider the simplest case β a haploid population from which we have a sam-ple of π individuals genotyped at a single nucleotide polymorphism (SNP). Following[McVean, 2009], we make the following assumptions:
A1. SNPs are identically distributed: Since all sites evolve under the same demographicmodel, the observed genotype π§π at any SNP is a realization of the same random vari-able ππ.
A2. SNPs segregate in the sample: Since exactly one mutation occurs in every sampledgenealogy, we observe both the ancestral allele '0' and the derived allele '1' at everysite.
A3. e scaled mutation rate π is low: Since A2 and A3 together imply π is a nuisanceparameter [Nielsen, 2000], we can take the limit π = 2π0π’ β 0 and thus ignoresmall differences in mutation rate across SNPs.
Under these assumptions, the probability that individuals π and π share the derived mu-tation at a randomly chosen segregating site is given by
Eβ{ππππ} =πππππ β πππ
ππ‘ππ‘, (3.1)
where πππππ and ππ‘ππ‘ are the height and the size of the expected genealogy of the sam-ple, and πππ is the expected time for π and π to coalesce in a sample of size 2 [McVean,2009]. e symbol β indicates the condition that both 0s and 1s are observed, i.e., theexpectation on the left in equation (3.1) is with respect to all possible genealogies (ob-served or not) with exactly one mutation. e expectations on the right in equation(3.1) are unconditional. e relevance is that for Kimura's stepping-stone model thereis an explicit formula for pairwise coalescence times, πππ, and a good approximation interms of effective resistances, π ππ.
Furthermore, since theππs are binary random variables and the time to coalescencewith self is always 0,
Eβ{ππ} = Eβ{π2π } = πππππ
ππ‘ππ‘. (3.2)
erefore, the expected genealogy fully specifies the first two moments of the allelecount vector π = (ππ) at a particular segregating SNP. In matrix notation,
Eβ{π} = πππππππ‘ππ‘
1 β‘ π1, (3.3a)
varβ{π} = πππππππ‘ππ‘
(1 β πππππππ‘ππ‘
) β 1ππ‘ππ‘
π β‘ π2(11β² β ππ). (3.3b)
22
For sample with configuration πΌ from a population with model πΊ, the parameters aregiven by
π = πππππππ‘ππ‘
, π2 = πππππππ‘ππ‘
(1 β πππππππ‘ππ‘
), ππ2 = 1ππ‘ππ‘
, (3.4)
where π = (πππ) is the matrix of expected pairwise coalescence times between sampledindividuals. at is, πππ is the expected time to coalescence between π β πΌπ and π β πΌπin a sample of size 2, regardless of the composition of the entire sample πΌ. Since πππdoes not depend on the sample configuration or even the sample size π, it is completelydetermined by the population model πΊ. However,
β’ e expected height and size of the sample genealogy, πππππ and ππ‘ππ‘, depend onboth the population model πΊ and the sample configuration πΌ. In particular, they arestrongly influenced by uneven sampling. erefore, πππππ/ππ‘ππ‘ and 1/ππ‘ππ‘ are nui-sance parameters because it would be very hard to decouple the effects of populationstructure from the effects of uneven sampling. e confounding of population andsample-specific information also makes it difficult to interpret PCA projections interms of a (historic) demographic process [Novembre et al., 2008, McVean, 2009].
β’ e matrix π = (ππ,π βΆ individuals π, π) describes the expected genetic differentiationin the sample and has a block structure which depends on how many individuals, ifany, we observe from each deme. On the other hand, π = (ππΌπ½ βΆ demes πΌ, π½) spec-ifies how genetic variation increases with geographic distance for all pairs of demes,whether they are sampled from or not. us π is a dissimilarity matrix that charac-terizes the entire population. Although π is a function of the sample configuration,it depends on πΌ in a straightforward way:
π = π½ππ½β² β diag {π½ππ½β²}, (3.5)
where π½ β‘ π½(πΌ) = (π½ππΌ) β β€πΓπ is an indicator matrix such that π½ππΌ = 1 if π β πΌ and0 otherwise. And we remove the diagonal because the coalescence time with self isalways 0.
e demographic model πΊ, which describes the population, determines the coalescentprocess and hence the expected pairwise coalescence timesππΌπ½ for all deme pairs (πΌ, π½).On the other hand, both the model πΊ and the configuration πΌ determine the genealog-ical statistics πππππ and ππ‘ππ‘ which are generally not of interest as the goal is to esti-mate population-level features of πΊ β such as the migration rates between pairs ofconnected demes β while accounting for the sample specific features of πΌ. In this the-sis πΊ = (π, πΈ, π) is always a population graph (π, πΈ, π) with equally sized demes π,undirected edges πΈ and effective migration rates π βΆ πΈ β β+.
We have shown that the expected mean and variance of a genotype vector are com-putable functions of the effectivemigration ratesπ. Next we derive similar expressionsfor the mean and the variance as functions of expected coalescence times in the case ofdiploid SNPs and microsatellites.
3.1.1 e case of diploid data
Since a diploid individual is the offspring of a pair of diploid parents, we can representthe genotype of a diploid as the sum of two haploids, each drawn randomly from thesame location, i.e., ππ = π(1)
π + π(2)π β {0, 1, 2} where the superscript indicates one of
two haplotypes. However, since we do not distinguish between the haplotype inherited
inferring effective migration from geographically indexed genetic data 23
from the mother and the haplotype inherited from the father, this assumption is rea-sonable only for autosomal SNPs (and not for sex-linked ones) in outbred individuals.
A sample π1, β¦ , ππ of π diploid individuals is polymorphic if
{π1, β¦ , ππ βΆ at least one ππ β₯ 1}β {π(1)
1 , π(2)1 , β¦ , π(1)
π , π(2)π βΆ π(1)
π = 1 or π(2)π = 1}. (3.6)
at is, a segregating SNP in a diploid sample of size π is equivalent to exactly one mu-tation in a haploid sample of size 2π. [is excludes the possibility that all individualscarry the same allele, either ancestral or derived.]
Furthermore, at a segregating site in a diploid sample, the copies π(1)π and π(2)
π ,which constitute ππ, are not independent β the event that one carries the mutationbut not the other is informative for the time to their most common ancestor. ere-fore,
Eβ{ππ} = Eβ{π(1)π } + Eβ{π(2)
π } = 2Eβ{ππ} = 2π (3.7a)
varβ{ππ} = 2varβ{ππ} + 2covβ{π(1)π , π(2)
π } = 4π2 β 2ππ2πππ (3.7b)
covβ{ππ, ππ} = 4covβ{ππ, ππ} = 4π2 β 4ππ2πππ (3.7c)
where the symbol β indicates the condition that there is exactly one mutation in a sam-ple of 2π haplotypes [and πππ is the expected coalescence time for two distinct lineageswith the same origin as individual π]. In matrix notation,
Eβ{π} = 2π1, varβ{π} = 4π2(11β² β ππ2), (3.8)
where
π2 = π½ππ½β² β 12 diag {π½ππ½β²}. (3.9)
e subscript 2 indicates that the matrix of pairwise coalescence times corresponds toa diploid population. Here the mean does not depend on the location. (is is the casefor haploid data as well.) However, the variance varβ{ππ} can vary with location unlessthe demographic model implies ππΌπΌ = π0 for all demes πΌ, i.e., isotropic migration.
3.2 Mean and covariance of genotype vectors: microsatellites
Microsatellites (also called short tandem repeats) are repeating sequences of a particularshort DNA segment. Mutation can increase or decrease the number of repeats π, andeach π corresponds to an allele.
To model microsatellites, we assume that a locus π evolves from its ancestral alleleπ΄π according to a symmetric stepwise mechanism where mutations occur with rate ππ and each mutation increases or decreases the number of repeats by exactly one, withequal probability. Here we consider the evolution at a particular site, and for simplicityof notation, we omit the subscript π in the rest of this section.
e ancestral allele π΄ and the mutation rate π are unknown site-specific parameterswhile the genealogy π― has a distribution determined by Kingman's coalescent. As wedid for SNPs, we assume that themicrosatellites are neutral and hence their genealogiesare identically distributed. On the other hand,microsatellites are usually highly variablemarkers (i.e., with high mutation rates), so we cannot take the low-mutation limit.
Conditional on the mutation rate π and the genealogical tree π― of the sample, mu-tations occur independently and the number of mutations on a branch with length π‘ is
24
a Poisson random variable with mean ππ‘. is follows from the assumption that muta-tions are generated by a Poisson process with intensity [mutation rate] π. For example,the total number of mutations is
πΎπ‘ππ‘ | π, π― βΌ Po(ππ‘π‘ππ‘), (3.10)
while the number of mutations carried by individual π isπΎπ | π, π― βΌ Po(ππ‘ππππ). (3.11)
All lineages share the samePoissonmeanparameter because every branch froma lineageto the most common ancestor of the entire sample has length π‘ππππ.
Let π¦ denote the set of all mutations that occur in the genealogy, with |π¦| = πΎπ‘ππ‘.Also, let π¦π β π¦ denote the set of mutations carried by individual π, with |π¦π| = πΎπ.Since each mutation is equally likely to decrease or increase the allele length by 1, theπth allele is
ππ = π΄ + βπβπ¦π
ππ, (3.12)
where ππ = Β±1 with probability 1/2 and thus E{ππ} = 0 and var{ππ} = E{π2π} = 1.
First we derive the mean and variance of allele ππ given the mutation rate, the an-cestral allele and the genealogy. e binary variables, ππ, are independent of the samplehistory, so E{ππ | π, π΄, π― } = E{ππ} and var{ππ | π, π΄, π― } = var{ππ}. And furthermore,conditional on the number of mutations, the ππs are mutually independent. erefore,
E{ππ | π, π΄, π― } = π΄ + E{E{ βπβπ¦π
ππ |πΎπ}} = π΄ + E{πΎπβπ=1
E{ππ}} = π΄, (3.13a)
var{ππ | π, π΄, π― } = E{πΎπβπ=1
E{π2π}} + E{ β
πβ πβ²E{ππππβ²}} = E{πΎπ} = ππ‘ππππSince the mutations are independent,
E{ππππβ² } = E{ππ}E{ππβ² } = 0 for π β πβ².
, (3.13b)
because πΎπ is a Poisson random variable with mean ππ‘ππππ by equation (3.11).Let π¦πβπ be the set of mutations that occur in one lineage but not the other, with
|π¦πβπ| = πΎπβπ. Suchmutations occur on the branch from π toππππ(π, π) or on the branchfrom π to ππππ(π, π). erefore, πΎπβπ has mean 2ππ‘ππ. Similarly, let π¦π\π be the set ofmutations carried by π but not π.
E{(ππ β ππ)2 | π, π΄, π― } = E{( βπβπ¦π\π
ππ β βπβπ¦π\π
ππ)2} = E{
πΎπβπ
βπ=1
E{π2π}} = E{πΎππ} = 2ππ‘ππAgain, the cross terms are 0 by mutual
independence.
,
(3.14a)
cov{ππ, ππ | π, π΄, π― } = var{ππ | π, π΄, π― } β 12E{(ππ β ππ)2 | π, π΄, π― } = ππ‘ππππ β ππ‘ππ.
(3.14b)
Now we have expressions for the mean, variance and covariance of the genotypes at aparticular microsatellite, given the site-specific mutation rate π, ancestral allele π΄ andgenealogyπ― . We treatπ andπ΄ asnuisance parameters to be estimated andwemarginal-ize the genealogy out. e goal is to express the model in terms of the expected coales-cence times rather than the coalescence times at a particular site. We took the sameapproach for SNP data but in the former case, π΄ = 0 for every segregating site and π iseliminated in the small mutation limit π β 0. Finally,
E{ππ | π, π΄} = E{π΄ | π, π΄} = π΄,E{π} = E{E{π | π}} (3.15a)
var{ππ | π, π΄} = E{ππ‘ππππ | π, π΄} + var{π΄ | π, π΄} = ππππππ,var{π} = E{var{π | π}} + var{E{π | π}} (3.15b)
cov{ππ, ππ | π, π΄} = E{ππ‘ππππ β ππ‘ππ | π, π΄} + var{π΄ | π, π΄} = π(πππππ β πππ)cov{π, π} = E{cov{π, π | π}} +
cov{E{π | π}, E{π | π}}
. (3.15c)
inferring effective migration from geographically indexed genetic data 25
In the case ofmicrosatellites, we donot condition on observing variability in the sample,i.e., on the event {πΎπ‘ππ‘ > 0} as microsatellites have higher mutation rates and we canestimate the parameter rather than take its limit to 0. For SNPs such that we observeexactly one mutation at every site, the "variability" condition is explicitly modeled be-cause it modifies the genealogy distribution. Intuitively, it "stretches" the tree and thuschanges (proportionally) all branches π‘ β π― .
erefore, the genotype vector ofπ sampled individuals at a particularmicrosatellitehas mean and variance
Eβ{π} = π1, varβ{π} = π2(11β² β ππ) (3.16)
where the symbol β indicates conditioning on the ancestral allele π΄ and the mutationrate π, and the parameters are given by
π = π΄, π2 = ππππππ, π = 1πππππ
. (3.17)
As for SNP data, the mean and the variance of genotypes at a particular locus do notdepend on the origin of an individual. However, for microsatellite data, the mean andthe variance vary across sites because the ancestral allele π΄ and the mutation rate π areboth site-specific parameters. On the other hand, the scale π is shared across sites andtherefore every site has the same correlation matrix Ξ£ β‘ 11β² β ππ.
With this parametrization, the demographic parameters are estimable up to a pro-portionality constant. If wemultiply themigration and coalescence rates by 2, we speedup the structured coalescent process by a factor of 2, and hence, we decrease the ex-pected coalescence times by 2. However, the covariance matrix Ξ£ remains unchangedbecause the dissimilarity matrix π is appropriately scaled.
3.3 Effective migration can explain spatial structure in genetic variation
In the previous section, we discussed how to specify the mapping from the stepping-stone model πΊ = (π, πΈ, π) to the genetic covariance matrix cor{π} = Ξ£, for bothSNP and microsatellite data. Briefly, we followed three steps. First, πΊ = (π, πΈ, π)determines π = (ππΌπ½) through the system of linear equations (2.15). en, in turn,the expected coalescence times between demes, π, determine the expected coalescencetimes between sampled individuals, π, through equation (3.5). Finally, the distancematrix determines the correlation matrix Ξ£ = 11β² β ππ by equation (3.3b) where π isan appropriately chosen scalar parameter that guarantees Ξ£ is positive definite.
Our goal is to estimate the effective migration rates π across the habitat; these aresample-independent (population-level) features of the population graphπΊ. emean πand the varianceπ2 of derived alleles as well as the scale factorπ of expected coalescencetimes can be treated as nuisance parameters because they are sample-dependent andshared by all individuals in the sample. For example, for haploid SNPs the overall meanis π = πππππ/ππ‘ππ‘ [with π2 = π(1 β π)] and the scale factor is π = 1/ππ‘ππ‘, so (π, π2, π)contain some information about πΊ. Although the scalars ππ‘ππ‘ and πππππ are, formally,functions of the effective migration rates π they are very difficult to compute.
On the other hand, the matrix π = (πππ) of pairwise coalescence times is a com-putable function of π. is matrix is also a pairwise dissimilarity (distance) matrix[and formally, a semivariogram]: the more genetically dissimilar two individuals are,the longer the time to their most recent common ancestor because the probability thatthe branch πππ accumulates a mutation is proportional to its relative length in the aver-age genealogy tree. e property that π is a distance matrix is important because it can
26
explain genetic dissimilarities (correlations) as a linear function of distances betweenlocations. Expected coalescence time is a particular choice of distance metric motivatedby coalescent theory [McVean, 2009]. We can consider other metrics such as effectiveresistance [McRae, 2006].
π β‘β§{β¨{β©
π = { migration rates π(π) }πΆ = { conductances π(π) }
βΆ βπ β πΈβ«}β¬}β
π β ππ is a symmetric matrix of
weights.
(1)βΆ β β‘β§{β¨{β©
π = { coalescence times ππΌπ½ }π = { effective resistances π πΌπ½ }
βΆ β(πΌ, π½) β π Γ πβ«}β¬}β
β β π»π is the population distance
matrix.
(2)βΆ β β‘β§{β¨{β©
π = { coalescence times πππ }π = { effective resistances π ππ }
βΆ β(π, π) β πΌβ«}β¬}β
β β π»π is the sample distance matrix.
(3)βΆ Ξ£ β‘ 11β² β πβΞ£ β ππ is the sample covariance matrix.
e first step, denoted by(1)βΆ, is to compute all π(π + 1)/2 pairwise distances between
π demes. is operation is expensive even for medium-size grids. However, the covari-ancematrixΞ£ is a function of the sample distancematrixΞ, not the population distancematrix Ξ. at is, in principle, we could avoid computing the full π Γ π dissimilarity ma-trix, especially for sparsely sampled habitats. [is is the advantage of π over π.]
In a certain sense, π is an "appropriate" dissimilarity measure for population struc-ture as genetically similar individuals are likely to have a recent common ancestor andthus shorter coalescence time. For the stepping-stone model we can obtain the matrixof pairwise coalescence timesπ exactly or approximate it with thematrix of effective re-sistances, π . However, the stepping-stone model itself does not represent the true his-tory of the populationβ the grid is placed arbitrarily and there are underlying assump-tions, including equilibrium in time, lowmutation rate and no selection. erefore, in amanner similar to McRae's definition of the effective migration rate, ππΌπ½, for a pair ofdemes, we should interpret the migration rate function π = {ππΌπ½ βΆ (πΌ, π½) β π Γ π}as effective migration surface because it would produce the observed patterns of geneticdifferentiation if the population were evolving under the stepping-stone model.
3.4 Related methods for analyzing population structure
We have shown that genetic correlations can be modeled in terms of a distance ma-trix. is representation is motivated by the relationship between genetic similaritiesand expected coalescence times. However, we can consider other distance metrics (onthe population graph) as long as they capture relevant features of a spatially heteroge-neous habitat, and effective resistance is particularly useful because it approximates thecoalescent-based metric and is efficient to compute.
Here we discuss briefly two related methods for analyzing spatially distributed pop-ulations.
3.4.1 MIGRATE
[Beerli and Felsenstein, 2001] develop an approach to estimate migration rates amongdemes, and more generally, to compare and rank structured population models. eir
inferring effective migration from geographically indexed genetic data 27
method MIGRATE is also based on the structured coalescent but it makes different as-sumptions about the spatial distribution and the migration pattern.
In MIGRATE the demes are sampling locations and all demes potentially exchangemigrants, so the population graph is constructed without explicit geographic informa-tion. [Some edges can be excluded to test and compare various migration patterns.]Every deme in the resulting graph has a size parameter and every edge has two migra-tion parameters. [MIGRATE allows asymmetric gene flow.] us for a graph with πdemes, the most complex model to test has π(π β 1) migration rates and π deme sizes.
In contrast, our method uses a regular triangular grid constructed independently ofthe sampling configuration [or an a priori grouping of individuals into subpopulations].Migration is symmetric and constrained to occur only between neighboring demes butnot all demes need to be sampled. A Voronoi tessellation of a Euclidean
space is a partition into π convex poly-gons (tiles) generated by π distinctpoints (centers). e region associatedwith the π‘th center π’ is the set of pointscloser to π’ than any other center. Bound-ary points are equidistant to two centers.[Okabe et al., 2000].
And edges are grouped via a Voronoi tessellation ofthe habitat to encourage parameter sharing and locally constant migration. is repre-sentation is flexible and the number of (unique) migration rates varies with the numberof tiles.
3.4.2 GENELAND
[Guillot et al., 2005] also uses Voronoi tiling to model the spatial structure in geneticvariation but their method GENELAND is cluster-based and thus best suited to ana-lyze discrete structure. Since individuals sampled from geographically close locationsare more likely to come from the same subpopulation, GENELAND attempts to findclusters that are both genetically and geographically coherent. Compared with a spatialrepresentation in terms of a population graph, such clusters can correspond to singledemes in the graph (e.g., if migration is low and even demes close in space are clearlydifferentiated); or they can correspond to groups of demes where allele frequency dis-tributions are indistinguishable (e.g., if gene flow is high so that a mutation that arisesin one deme can quickly ''spread'' to nearby locations).
4
Estimating Effective Rates of Migration
In this chapter we introduce a likelihood function and prior distributions to performBayesian inference for the effective migration surface π based on the similarities ob-served in georeferenced genetic data. e posterior estimate ofπ can represent graphi-cally population-level features such as barriers tomigration, ormore generally, the com-bined effect of evolutionary processes on genetic differentiation.
Our method assumes that we have data for π individuals sampled from a spatiallydistributed population at locations (π₯1, π¦1), β¦ , (π₯π, π¦π) and genotyped at π loci, eitherSNPs or microsatellites. e geographic information is used to assign individuals tothe closest deme in the population graph (π, πΈ); this defines the sample configurationπΌ = (πΌ1, β¦ , πΌπ). GivenπΊ = (π, πΈ, π)with symmetric migration ratesπ = (ππΌπ½)wecan compute the pairwise distancematrix for entire populationΞ = (ΞπΌπ½); givenΞ andthe deme indicators πΌ we can obtain the expected pairwise distances for the observedsample Ξ = (Ξππ). Notation: Here we discuss the likelihood of the sample, so we willwrite simply Ξ throughout as there is no need to distinguish between the populationand the sample distance matrices.
In the previous chapter we derived expressions for the mean and variance of theallele count vector π = (ππ) at a segregating site [eq. (3.3) for single nucleotide poly-morphisms; eq. (3.16) for microsatellites]. Recall that
E{π} = π1, var{π} = π2(11β² β πΞ), (4.1)
where π is the allele frequency and π2 is the variance in allele frequency [in the sample,not the population]. It is convenient to normalize Ξ so that 1β²Ξβ11 = 1; then thecorrelation matrix Ξ£ = 11β² β πΞ is positive definite for π β (0, 1) [Appendix 7.2].
Recall further that neutral sites (not under selection) develop under the same co-alescent process, and therefore, the genotype vectors π = (π1, β¦ , ππ) β β€πΓπ at πsegregating sites have the same correlation matrix Ξ£. e scalar parameters π, π2 canvary across sites. For microsatellites π is the ancestral allele and π2 depends on themu-tation rate π, and both are site specific. For SNPsπ is the expected allele frequency if thederived allele is coded as 1; but the labels might not be consistent as usually the minorallele is coded as 1.
Our aim here is to incorporate these expressions for the mean and variance intoa likelihood function in order to infer effective migration rates from observed data.Note that every individual has mean π regardless of location; intuitively, the sharedparameter π contains little information about patterns of genetic differentiation be-tween individuals, as we discuss in Section 4.4. So, to simplify, assume that we ob-serve the pairwise differences, ππ β ππ, rather than the allele counts ππ. Equivalently,assume that we observe πΏπ where πΏ β β(πβ1)Γπ is a basis for contrasts, e.g., πΏ =
inferring effective migration from geographically indexed genetic data 29
(π2 β π1, π3 β π1, β¦ , ππ β π1)β² where ππ is the standard basis vector with 1 in the πthcoordinate and 0 otherwise. Note that
E{πΏπ} = 0, var{πΏπ} = βπβπΏΞπΏβ², (4.2)
where we define πβ = ππ2 because the variance and the scale are longer identifiable.e matrix βπΏΞπΏβ² is positive definite, and thus a valid covariance matrix, because thedistance matrix Ξ is nonnegative definite on contrasts and πΏπ£ is a contrast for everyπ£ β βπβ1.
erefore, it might be natural to assume a Normal likelihood for the pairwise differ-ences,
πΏπ | πβ, Ξ βΌ Nπβ1(0, βπβπΏΞπΏβ²). (4.3)
Suppose further that the genotypedmarkers are independent; then it is straightforwardto extend the Normal likelihood (4.3) for one locus to multiple loci. In particular, forSNP data where usually there are many more SNPs than individuals and mutation ratesare low, let π = ππβ²/π be the observed similarity matrix averaged across π SNPs. enπΏππΏβ² is a scatter matrix of pairwise differences and
πΏππΏβ² | πβ, Ξ βΌ Wπβ1(π, βπβ
π (πΏΞπΏβ²)), (4.4)
where the degrees of freedom are the number of independent SNPs and and the scaleparameter πβ is shared. erefore, by considering the pairwise differences, we avoidestimating a nuisance parameter π with dimensionality that grows with the number ofmarkers π. In practice we also gain efficiency with faster MCMC convergence.
4.1 Effective degrees of freedom for SNP data
So far we have considered the case where the π genotyped markers are independent(unlinked). e assumption of independence between loci is very strong and likely tobe violated. In particular, SNPs in close proximity are often associated (in linkage dis-equilibrium) because individuals inherit long segments of unbroken DNA from theirparents. For this reason, SNPs data is often ''thinned'' by removing SNPs in high LD.We propose an alternative method to correct for model mis-specification due to bothdependence between SNPs and non-normality of genotypes.
In the Wishart likelihood (4.3) the scatter matrix of contrasts, πΏππΏβ², has known de-grees of freedom π. However, instead of fixing the degrees of freedom to the number ofgenotyped SNPs, we can estimate this parameter. e likelihood for the scatter matrixbecomes
πΏππΏβ² | π, πβ, Ξ βΌ Wπβ1(π, βπβ
π (πΏΞπΏβ²)), (4.5)
with degrees of freedom π β (π, π). Both Wishart likelihoods (4.3) and (4.5) implyE{πΏππΏβ²} = βπβπΏΞπΏβ². erefore, estimating the degrees of freedom does not affect theexpected pairwise differences as a function of effectivemigration. However, theWishartvariance is proportional to (πβ)2/π, so it we infer π β (π, π) rather than set π = π,the model variance increases as we would expect if the data contain less informationthan the sample size suggests, or more generally, if the model is mis-specified. Undernormality, π = π implies that all sites are independent; otherwise, the variance increasesby a factor of π/π.
30
4.2 Prior on migration surface represented as a Voronoi tessellation
We have proposed a model for population structure in terms of expected pairwise dis-tances on a population graph πΊ = (π, πΈ, π) where (π, πΈ) is a rectangular grid andπ assigns effective migration rates to edges in the graph. e goal is to estimate theeffective migration surface π so that the demographic model πΊ explains the observedgenetic dissimilarities. e grid is fixed; the likelihood is defined in the previous section.Here we consider prior specification for π.
e regular grid (π, πΈ) is not determined by the sampling locations and it yields ahigh-dimensional, flexible representation so that fine features in the effectivemigrationsurface can emerge if supported by the data. To take advantage of this flexibility, weorganize the edges in terms of a Voronoi tessellation of the habitat. Statistically, theVoronoi decomposition offers the advantages of parameter sharing and a locally smoothmigration surface. Previous applications ofVoronoi tiling in population genetics include[Guillot et al., 2005] and [Wasser et al., 2004].
A Voronoi tessellation of the migration surface π is fully specified by the numberof tiles π, their locations π’ and migration rates π. us π = {ππ‘ βΆ π‘ = 1, β¦ , π} is theset of effective migration rates for the π tiles in the partition. Furthermore, let edge(πΌ, π½) β πΈ have migration rate
ππΌπ½ = 12ππ‘πΌ + 1
2ππ‘π½ , (4.6)
where π‘πΌ denotes the tile deme πΌ falls into. at is, the rate of an edge is the averagerate of the two tiles it connects.
Migration rates are naturally positive and therefore we parametrize them on the logscale as differences from the overall mean rate β sπ,
log10(ππ‘) = β sπ + ππ‘. (4.7)
If the effect of distance on differentiation is space-homogeneous and the tile-specificeffects ππ‘ are (close to) 0, the migration pattern thus produced would correspond toisolation by distance.
erefore, our model has the following parameters:
1. parameters of interest Ξ1 that determine the effective migration rates π and thusthe effective pairwise distances Ξ. ese are
β’ (π, β sπ, π2π): number of tiles, mean and variance of tile migration rates on the log(base 10) scale.
β’ {(ππ‘, π’π‘) βΆ π‘ = 1, β¦ , π}: relative effect and center location for each Voronoi tileπ‘. e dimensionality of this group of parameters changes with the number ofVoronoi tiles π.
β’ π: effective degrees of freedom for SNP data where we observe more sites thanindividuals, i.e., π > π.
2. nuisance parameters Ξ0 that do not depend on the demographic model. For SNPdata this is the scale parameter πβ; for microsatellite data each site has its own scaleparameter πβπ because mutation rates vary across sites and under the stepwise mu-tation model the scale πβπ is the mutation rate ππ .
Using the Voronoi tessellationπ±(π, π’, π) to representπ, we can have fewer than β£πΈβ£ rateparameters to estimate but we do not know how many tiles we need and where theircenters are. is depends on the patterns of genetic differentiation across the habitat.
inferring effective migration from geographically indexed genetic data 31
To complete the Bayesian specification we place priors on the model parameters:
(number of Voronoi tiles) π | π βΌ Po(π), (4.8a)
(tile locations) π’ | π iidβΌ U(β), (4.8b)
(tile effects) π | π2π , π iidβΌ N(0, π2π ). (4.8c)
e hyperparameter π controls howmuch spatial heterogeneity the effective migrationsurface exhibits. e rate hyperparameters are
(overall migration rate) β sπ βΌ U(πππ, π’ππ), (4.9a)
(tile variance) π2π βΌ Inv-G(π/2, π/2). (4.9b)
[For all results we report here π = 6, π = 3.] e lower and upper bounds on the meanlog rate are chosen so that the mean migration rate varies in the range [1/300, 300]on the original scale. e bounds are somewhat arbitrary but based on simulations ofgenetic data with ms [Hudson, 2002]. Restricting the support is necessary because themodel is not numerically stable at the two extremes:
β’ When migration rates are very small (relatively to coalescence rates), it takes verylong time on average for two lineages from different demes to coalesce. In the limit,the population is a collection of unrelated subpopulations that evolve independently.
β’ Whenmigration rates are very large (relatively to coalescence rates), the time it takestomove fromonedeme to another is negligible compared to the coalescence times. Inthe limit, the population behaves like a panmictic population without any structure.
e prior on the effective degrees of freedom is uniform on the log scale:
(degrees of freedom) π(π) β 1π . (4.10)
e prior is proper because π is bounded: π > π because π is the degrees of freedomparameter in aWishart distribution, and π < π because π should not exceed the numberof observed sites (features). [e normalizing constant is therefore log(π) β log(π).]
(scale parameter) πβ βΌ Inv-G(π/2, π/2). (4.11)
[For all results we report here π = 1, π = 1.]We use reversible-jump MCMC to estimate π as the dimension of both π’ and π
changes as the number of tiles π increases or decreases. Full details about the MCMCimplementation are given in Appendix 7.6.
4.3 Likelihood for distance matrices
e Wishart likelihood (4.5) is given in terms of the contrast basis πΏ but it does notdepend on the choice ofπΏ. Instead, it can bewritten in terms of the observed similaritiesπ = ππβ²/π, the model distances Ξ and its orthogonal projection π given by
π = πΌ β 11β²Ξβ1
1β²Ξβ11 . (4.12)
32
In Appendix 7.4 we show that the Wishart log likelihood that corresponds to themodel (4.5) can be written as
β(π, πβ, Ξ) =
β§{{{{{{{{{{{{{{β¨{{{{{{{{{{{{{{β©
+ [π/2] logdet { β (πΏΞπΏβ²)β1/πβ}
β [π/2] tr { β (πΏΞπΏβ²)β1πΏππΏβ²/πβ}
+ [(π β π)/2] logdet {πΏππΏβ²}
+ [π(π β 1)/2] log (π/2)
β log Ξπβ1(π/2)
=
β§{{{{{{{{{{{{{{{β¨{{{{{{{{{{{{{{{β©
+ [π/2] log Det { β Ξβ1π/πβ}
β [π/2] tr { β (Ξβ1π)π/πβ}
+ [(π β π)/2] logdet {π}
+ [π(π β 1)/2] log (π/2)
β log Ξπβ1(π/2)
+ [(π β π)/2] log (1β²πβ111β²1 )
β [π/2] logdet {πΏπΏβ²}
(4.13)
e only term that involves the residual basis πΏ is (π/2) logdet {πΏπΏβ²}. Regardless of thechoice for πΏ, this terms does not depend on the parameters (π, πβ, Ξ). Full details aboutthe likelihood computation are given in Appendix 7.5.
4.3.1 Related model
is is the marginal likelihood for distance matrices introduced in [McCullagh, 2009].Let π· the π Γ π pairwise dissimilarity matrix given by
π· = 1 diag(π)β² + diag(π)1β² β 2π. (4.14)
e orthogonal projection π = πΌ β 11β²Ξ£β1/(1β²Ξ£β11) satisfies
πβ²Ξ£β1 = πβ²Ξ£β1π = βππβ²Ξβ1π = βππΞβ1 (4.15)
since ker {π} = {1} and thus π1 = 0. Similarly, ππ·πβ² = β2πππβ². erefore, forfixed π = π and after we ignore all terms that do not involve Ξ or πβ,
β(πβ, Ξ ; π) β β(π2, Ξ£ ; π·) β π2 log Det { β Ξβ1π/πβ} β π
2 tr { β Ξβ1ππ/πβ}(4.16a)
= π2 log Det {Ξ£β1π/π2} + π
4 tr {Ξ£β1ππ·/π2} (4.16b)
where πβ = ππ2.Recently, [Hanks and Hooten, 2013] build this likelihood into a parametric model
for isolation by resistance [McRae, 2006]. Briefly, the genetic data is generated by aGaussian Markov random field on an undirected graph (circuit). e covariance struc-ture is given by an intrinsic conditional autoregressive model, i.e., the conditional dis-tribution of each node given the rest of the graph is normal withmean and variance thatdepend on its first-degree neighbors only. [Hanks and Hooten, 2013] specify the modelso that the expected square differences between nodes are exactly effective resistancedistances on the population graph. In our notation, let Ξ = π be the matrix of effective
inferring effective migration from geographically indexed genetic data 33
resistances. [Note that this is slightly different from the IBR-based approximation toexpected coalescence times Ξ = π.] [Bapat, 2004] shows that
π β1 = β12πΏ + ππβ² (4.17)
where πΏ is the Laplacian of the graphπΊ = (π, πΈ, π) and ππβ² is a rank-one update. en
π β1π = π β1 β π β111β²π β1
1β²π β11 = ( β 12πΏ + ππβ²) β (π(1β²π))(π(1β²π))β²
(1β²π)2 = β12πΏ (4.18)
(π + 11β²)β1 = π β1π + π β111β²π β1
1β²π β11 β π β111β²π β1
1 + 1β²π β11
= π β1π + π β111β²π β1
(1β²π β11)(1 + 1β²π β11)
= β12πΏ + ππβ²
1 + (1β²π)2 (4.19)
π[π + 11β²] = πΌ β 1πβ²(1β²π)1 + (1β²π)2
1 + (1β²π)2
(1β²π)2 = πΌ β 1πβ²
1β²π (4.20)
(π + 11β²)β1π = β12πΏ + ππβ²
1 + (1β²π)2 β ππβ²
1 + (1β²π)2 = β12πΏ (4.21)
at is, π΅β1π[π΅]/4 = π β1π[π ] where π΅ = π /4 + 11β². [Hanks and Hooten, 2013] rep-resent conductances between connected nodes as a function of landscape features, e.g.,elevation. Instead we represent conductances [i.e., migration rates] through a coloredVoronoi tessellation and estimate edge weights without reference to available ecologicalvariables.
Modeling the dissimilarity matrixπ· instead of the raw allele countsπ is convenientbecause
β’ Suppose that π is an orthogonal transformation (rotation or reflection). en
ππ = (ππ)(ππ)β² = πππβ²πβ² = ππβ² = π
β’ Suppose π is a translation by π = (π1, β¦ , ππ)β². en
π·πππ = β¨(π§π β π) β (π§π β π)β©2 = β¨π§π β π§πβ©2 = π·ππ
Although the transformation from the entire data π to the summary π· is not a one-to-one transformation, wedonot lose information about relative distances, i.e., populationstructure. Instead we lose information about some nuisance parameters. For example,π β π· makes the labeling of the alleles irrelevant.
4.4 What do we lose from ignoring the means?
We can use themarginal likelihood (4.3) because sampled individuals are equally distantfrom the most recent common ancestor of the sample [the root of the genealogy tree],and therefore, share a commonmean. us πΎ = 1 is a basis for the mean space. [Recallthat πΏ is a basis for the residual space of pairwise differences.] erefore,
ββββββ
1β²
πΏββββββ
π βΌ Nπββββββ
ββββββ
π0ββββββ
, π2ββββββ
1β²Ξ£1 1β²Ξ£πΏβ²
πΏΞ£1 πΏΞ£πΏβ²
ββββββ
ββββββ
(4.22)
34
Let π = 1β²π = βππ=1 ππ and π = πΏπ.
π = πΏπ | π, Ξ£ βΌ Nπβ1(0, π2πΏΞ£πΏβ²) (4.23a)
π | π, Ξ£ βΌ N(π + 1β²Ξ£πΏβ²(πΏΞ£πΏβ²)β1π, π2[1β²Ξ£1 β 1β²Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏΞ£1])π = Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ
= N(π + 1β²ππ, π21β²1(1β²Ξ£β11)β11β²1)= πΌ β 1(1β²Ξ£β11)β11β²Ξ£β1
= N(π + 1β²ππ, (1 β π)π2π2)1β²Ξ£β11 = 1β²ββ11/(1 β π) (4.23b)
e conditional distribution ofπ depends onΞ only through the bias term 1β²ππ. ere-fore we choose to ignore it and use only the marginal distribution of π to infer Ξ.
4.5 Standardizing genotype data
Before performing PCA analysis for population structure it is common to standardizeSNPs and to set themissing alleles to the observed average at the correspondingmarker[McVean, 2009, Price et al., 2006]. e motivation is that without normalization SNPswith higher variance contribute more to the scatter matrix ππβ². erefore, the proce-dure tends to up-weight the influence of rare variants. Here we discuss why neithercentering the genotypes to have mean 0 nor standardizing the variance is appropriatewhen analyzing population structure.
In matrix notation, let πΆ = πΌ β 11β²/π be the centering matrix for π observations.en multiplying by πΆ removes the mean:
π = πΆπ iidβΌ Nπ(ππΆ1, π2πΆΞ£πΆ) = Nπ(0, βπβπΆΞπΆ), (4.24)
is operation is convenient becauseππβ² has centralWishart distribution. It alsomakesthe labelling of alleles as ancestral or derived ['0' or '1'] irrelevant, up to a change insign. Suppose that we "flip" the 0/1 labels at a particular site, i.e., π§β = 1 β π§. enπ₯β = πΆ(1 β π§) = βπΆπ§ = βπ₯ because πΆ1 = (πΌ β 11β²/π)1 = 0.
However, centering with πΆ assumes the individuals are independent and identicallydistributed, i.e., no population structure: IfΞ£ = πΌ, then the projectionπ onto the spaceof contrasts is π = πΌ β 11β²Ξ£β1/(1β²Ξ£β11) = πΆ. Since our model assumes the individu-als are coupled with correlation given by 11β² βπΞ, it is not appropriate to naively centerthe genotypes to have mean 0 or to substitute the average allele frequency for missingvalues. For SNP datasets, it is better to impute missing SNP values before analyzingpopulation structure. ere are various imputation algorithms but they all would takeinto account similarities across observed alleles to imputemissing ones. Formicrosatel-lite datasets, which are usuallymuch smaller and harder to impute, we use the likelihoodfor the observed pairwise distances only. [So that the sample configuration πΌ is reallysite-specific.]
Furthermore, itmight not be appropriate to standardize SPNs to have the same vari-ance precisely because this up-weights the contribution of rate alleles [McVean, 2009].A mutation in effect splits sampled individuals into two groups that are slightly differ-ent genetically β those that carry the mutation versus those that do not. Intuitively,newer and especially private mutations, which are carried by a single individual, are in-formative for structure that is too fine to represent with a model at the level of demes.
5
Simulations of Structured Genetic Data
In this chapter we describe several simulated scenarios that we use to evaluate the per-formance of our method for estimating effective migration as well as to illustrate someof its properties. We use the program ms [Hudson, 2002] to simulate genetic data un-der the structured coalescent. Given the model parameters (deme sizes and migrationrates) and the sample configuration, ms first generates a random genealogy, which de-scribes the history of the sample from the present to its most recent common ancestor,and then places a Poisson number of mutations uniformly (and independently of eachother) on the tree.
Weusems to simulate independent and identically distributed genealogies under thestepping-stone model πΊ = (π, πΈ, π) with conservative migration π = (ππΌπ½). ere-
fore, To generate histories with exactly onemutation, we choose a small mutationrate π and discard genealogies that carryzero or multiple mutations.
the iid assumption across sites holds but the normality assumption is violated. Inall examples, we generate π = 3000 single nucleotide polymorphisms for π = 300 hap-loid individuals on a 12 Γ 8 regular triangular grid. [e corresponding ms commands,with detailed explanations, are given in Appendix 7.8.]
5.1 Spatial structure due to constant migration
First we generate data under different patterns of migration β either uniform or witha barrier β to confirm that the method performs accurately when the underlying de-mographic model is correct. at is, the population does evolve on a known grid (π, πΈ)of equally sized demes and unknown migration rates. In these simulations, therefore,effective migration rates are true migration rates [up to a constant of proportionalitythat depends on the coalescent timescale π0. We set up the simulations so that thisconstant is 1.]. We report migration rates, as they are parametrized, on the log (base10) scale, and the blue/brown color scheme is based on [Brewer et al., 2003].
truth (uniform migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.1: Uniform migration rates andequal deme sizes: ππΌ = 1 for all πΌ β πand ππΌπ½ = 1 for all (πΌ, π½) β πΈ. e size ofthe gray circles indicates the number of in-dividuals sampled from the correspondingdeme.
In Figures 5.1 and 5.2 we directly compare the truth (left) with the posterior mean(right). Not every deme in the population graph is necessarily observed but sampling is
36
truth (sharp barrier to migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
truth (smooth barrier to migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.2: Barrier to migration and equaldeme sizes: migration rates vary betweenhigh, ππΌπ½ = 3, and low, ππΌπ½ = 1/3, ineither a sharp or smooth pattern.
a)
b)
Figure 5.3: Draws from the posterior dis-tribution of effective migration. a) sharpbarrier to migration; b) smooth barrier tomigration.
inferring effective migration from geographically indexed genetic data 37
.......................
balanced because areas with higher migration are sampled with higher probability. Inall three cases our method correctly captures the qualitative pattern of migration. Andin Figure 5.3 we show samples from the posterior distribution on the colored Voronoitessellation, to illustrate the uncertainty in the estimated effective migration surface.
5.2 Spatial structure due to variation in diversity
e next set of simulations demonstrate that effective migration reflects the combinedeffect of demographic processes on genetic differentiation. In particular, we use twoexamples to show how differences in effective population size can influence effectivemigration rates. In the first example, lower migration rates cancel the effect of biggerdemes sizes, to produce uniform effective migration. In the second example, only demesizes vary to produce the effect of a barrier to migration.
To describe the simulations, consider the example graph with two groups of demes,π΄ (circles, smaller in size) and π΅ (squares, bigger in size), with deme sizes ππ΄ and ππ΅,respectively. Letππ΄π΄ be themigration rate of allπ΄βπ΄ edges andππ΅π΅ be themigrationrate of all π΅ β π΅ edges. We assign migration rates to the ''across'' edges π΄ β π΅ and π΅ β π΄so that migration is conservative and deme sizes are constant over time, as required bythe stepping-stone model. Formally,
βπΎβΆπΎβΌπΌ,πΎβπ΄
ππ΄π΄ππ΄ + βπΎβΆπΎβΌπΌ,πΎβπ΅
ππ΄π΅ππ΄ = βπΎβΆπΎβΌπ½,πΎβπ΄
ππ΅π΄ππ΅ + βπΎβΆπΎβΌπ½,πΎβπ΅
ππ΅π΅ππ΅ (5.1)
by definition (2.14) where the coalescence rate is ππ΄ = 1/ππ΄. A sufficient condition forconservative migration is that
ππ΄π΅ππ΄ = ππ΅π΄ππ΅. (5.2)
is condition preserves the symmetry as much as possible because the number of mi-grants from πΌ β π΄ to π½ β π΅ is equal to the number of migrants from π½ to πΌ, i.e.,migration is balanced across every edge. erefore, given the deme sizes ππ΄ and ππ΅,we let ππ΄π΅ = 1/ππ΄, ππ΅π΄ = 1/ππ΅. [Or more generally, we can let ππ΄π΅ = ππΆ/ππ΄,ππ΅π΄ = ππΆ/ππ΅ for a given between-group rate ππΆ.]
In the first example, bigger demes exchange the same number ofmigrants as smallerdemes. To achieve this, we set ππ΅ = 5ππ΄, ππ΅π΅ = ππ΄π΄/5 and thus ππ΄π΄ππ΄ = ππ΅π΅ππ΅.All demographic parameters are scaled by the coalescent timescale π0, so the effectivemigration rate of both π΄ β π΄ and π΅ β π΅ edges is approximately ππ΄π΄ππ΄/π0 = 1/π0.at is, differences in population size are canceled by differences inmigration rate. Con-sequently, we expect the migration surface to be uniform, and indeed, this is what weobserve in Figure 5.5 a).
In the second example, bigger deme exchange more migrants. To achieve this, weset ππ΅ = 5ππ΄, ππ΅π΅ = ππ΄π΄ and thus ππ΅π΅ππ΅ > ππ΄π΄ππ΄. Since migration rates arerelative to deme sizes, at the samemigration rate bigger demes exchange a higher num-ber of migrants which results in higher effectivemigration. erefore, coalescence timesbetween π΅ demes [on the same side of the barrier but not across it] are shorter than co-alescence times between π΄ demes. Genealogies with such topology are consistent withhigher migration at equal coalescence rates because lineages that transition more oftenbetween demes have fewer chances to coalesce. Consequently, we expect a barrier toeffective migration, and indeed, this is what we observe in Figure 5.5 b).
38
.......................
5.3 Spatial structure due to a split event
e final sequence of examples produces a barrier effect from a past demographic eventthat removes edges in the graph and thus splits the habitat into two regions that nolonger exchange migrants.
To describe the simulations, consider the example graph with two groups of demes,π΄ (circles) and π΅ (squares), with the same deme size. [e demes in the middle, πΆ, arepart of the population but we collect no samples from that area which remains ''unob-served''.] ere are also two types of edges: the solid ones have constant migration rateπ; the dashed ones have migration rate 0 for π₯ units of time (measured in π0 genera-tions) and migration rate π from then on. Since Kingman's coalescent develops back-wards in time, this setup simulates a recent barrier to migration from the present topoint π₯ in the past. Beyond time π₯ the population graph is connected and has uniformmigration rate π.
In Figure 5.6 we increase the time of the split event from π₯ = 0.3 to π₯ = 9 unitsof time. If the split is too recent on the relative scale of the other parameters, its effectis hard to detect and the effective migration surface is uniform. Otherwise, the splitis detected as a barrier to effective migration. [e truth is a temporary barrier, themethod infers a constant barrier.] In these simulations, an equilibrium phase of highmigration followed by a recent interval of no migration produces genealogies that aredominated by a long branch between the common ancestor of π΄ lineages and that of π΅lineages. Such topology is consistent with constant migration at low rates across thearea separating π΄ and π΅.
5.4 e effect of SNP ascertainment
In this example we simulate the effect of ascertainment bias due to a very small dis-covery panel (Figure 5.4). In this case there is a true barrier to migration but the dis-covery panel comes from a very limited area on one side of the barrier. is skews theobserved genealogies as we observe only sites that are polymorphic in both the ascer-tainment sample (in red) and the analysis sample (in black). is example shows thatascertainment β which is not an evolutionary process β can have an effect of the in-ferred effective migration, especially if the discovery panel is not representative of thepopulation.
a) b) log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.4: Barrier to migration with as-certainment bias. a) True migration pat-tern and the discovery panel in red; b) Es-timated effective migration and the sam-ple in black.
inferring effective migration from geographically indexed genetic data 39
Figure 5.5: Population structure due todifferences in deme size. In a) biggerdemes exchange migrants at a lower rateand hence there is no variation in effectivemigration. In b) smaller demes exchangefewer migrants and hence there is an ef-fective barrier to migration.
a)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
b)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
Figure 5.6: A past demographic event re-sults in a barrier to effective migration.Here an ancestral population splits intosubpopulations π΄ (east) and π΅ (west) atpoint π₯ in the past. e further back intime this event occurs, the more differen-tiated π΄ and π΅ are. a) π₯ = 0.3; b) π₯ = 3; c)π₯ = 9 units of time which is measured inπ0 generations.
a)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
b)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
c)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
40
5.5 e effect of uneven sampling on PCA projection and effective migration
It iswell known that PCAprojections areheavily influencedby irregular sampling [McVean,2009]. To examine the impact of sample composition on effective migration, we simu-late genetic data under the same barrier pattern as in Figure 5.2 but with various sam-pling schemes. We compare ourmethod of estimating effectivemigration and PCA anal-ysis of the observed covariance matrix in Figure 5.7. Even if sampling is biased towardsone area of the habitat, the presence and location of the barrier are correctly detectedas long as there are observations on both sides. On the other hand, the overall pat-tern of the PCA projections changes considerably. [Our method can be sensitive to theplacement and coarseness of the population grid.]
Figure 5.7: Barrier to migration with un-even sampling; colors indicate samplinglocation. e final example illustrates thatnaturally the method cannot detect vari-ation from uniform migration in areaswhere no genetic data is observed.
inferring effective migration from geographically indexed genetic data 41
5.6 Summary
esimulations in this chapter illustrate that effectivemigration can represent the com-bined effect of various demographic processes and events on genetic similarity and thatour method is robust to uneven sampling but not ascertainment bias. However, effec-tive migration does not help us to distinguish among possible histories as in this frame-work population structure is always explained with a stepping-stone model on a fixedgrid of equally sized demes.
e examples also underline why it is difficult to interpret effective migration interms of actual evolutionary history. As [McVean, 2009] emphasizes, very different de-mographic processes can produce very similar average genealogies. Our method, justlike PCA, uses the information contained in pairwise comparisons averaged across sitesand discards the sequential information contained in the ordering of sites along chro-mosomes, which can be helpful in selecting between possible histories.
6
Empirical Results
In this chapter we apply out method on four empirical datasets [three consist of SNPsand one of microsatellites] and we further demonstrate that effective migration ratescan explain the spatial structure in genetic variation.
6.1 Red-backed fairywrens in Australia
First we present results for a sample of red-backed fairywrens (Malurusmelanocephalus),a small passerine bird endemic toAustralia [Figure 6.1]. eRBFWdatasetwas collectedto study its population structure and demographic history across the Carpentarian bar-rier. Sampling and genotyping procedures aswell as cluster-based analysis of populationstructure are described in [Lee and Edwards, 2008].
Figure 6.1: Habitat of the red-backedfairywren (Malurus melanocephalus), withtheCarpentarian barrier (the black bar), innorthern and eastern Australia. e mapshows the ranges of two subspecies: M.m.cruentatus in yellow,M.m. melanocephalusin pink, and a broad hybrid zone in orange.e map also shows three major biogeo-graphic regions: Top End (TE), Cape York(CY) and Eastern Forest (EF). e map ismodified from [Lee and Edwards, 2008].
e Carpentarian barrier in northern Australia is a semi-arid region, roughly 150kmwide and extremely poor in vegetation [Lee and Edwards, 2008]. It has been arguedthat this region has had a primary effect on species distribution in northern and east-ernAustralia by acting as a barrier tomigration, with secondary barriers along the coast.[Lee and Edwards, 2008] choose to study the demographic structure of the red-backedfairywren because its taxonomy, which is based mainly on plumage color, is not consis-tent with the Carpentarian hypothesis. e species has been traditionally categorizedinto two subspecies but their ranges do not lie on either side of the Carpentarian barrier,as we would expect if it has been the major barrier contributing to their divergence.
e dataset wasmade available to us by S. Edwards.After initial data processing, theRBFW dataset consists of π = 27 diploid individuals genotyped at π = 1190 bi-allelic,polymorphic SNPs. [As a reference to the original data, we remove 3 out of 30 individu-als because most of their genotypes are missing and we also exclude monomorphic andtri-allelic SNPs.]
inferring effective migration from geographically indexed genetic data 43
roughout we will refer to three subpopulations of red-backed fairywrens as iden-tified according to location in [Lee and Edwards, 2008]: Top End (TE) in northern Aus-tralia to the west of the Carpentarian barrier, Cape York (CY) in northeastern Australiato the east of the Carpentarian barrier and including the hybrid zone, and Eastern For-est (EF) in eastern Australia to the south of the hybrid zone [Figure 6.1].
Top End (TE)Cape York (CY)Eastern Forest (EF)
Top E
nd
Cape Y
ork
Easte
rn F
ore
st
Top E
nd
Cape Y
ork
Easte
rn F
ore
st
Figure 6.2: PCA and STRUCTURE analysisof the red-backed fairywren (RBFW) data.
First we perform principal components analysis (PCA) and cluster-based analysis(STRUCTURE). In Figure 6.2 (left) we plot the leading two principal components of thegenetic covariancematrix, which explain 55% and 10%of the variance, repectively. PCAdetects population structure but the results are difficult to interpret in terms of the evo-lutionary history of the species: there is some differentiation between the three sub-populations [in particular, Top End (TE) is better separated from Cape York (CY) thanEastern Forest (EF)] but there are no clearly delineated clusters. Although the threebiogeographic groups are about equally represented, the sample is very small and muchof the observed variation is between individuals within groups.
In Figure 6.2 (right) we report STRUCTURE results with two and three clusters, andusing the sampling locations as prior information [Pritchard et al., 2000, Hubisz et al.,2009]. As observed in [Lee and Edwards, 2008], if we use STRUCTURE to assign thesamples into two distinct clusters, Cape York (CY) and Eastern Forest (EF) are groupedtogether, which possibly indicates that the Carpentarian barrier has played a role inshaping the genetic differentiation of the red-backed fairy wren. In our analysis the data has a slightly
higher likelihood with three clustersrather than two as in [Lee and Edwards,2008].
When we use STRUC-TURE to assign the samples into three distinct clusters, CY and EF individuals are es-timated to be admixtures (with different proportions) of two ''ancestral'' populations.is suggests migration across the hybrid zone.
While both STRUCTURE plots might be interpreted to provide support for the Car-pentarian hypothesis, STRUCTURE does notmodel the geographic distribution of sam-ples across the habitat and thus cannot account for isolation by distance. In a homo-geneous habitat, where the population is spatially distributed with uniform migration,genetic differentiation tends to increase with geographic distance. e RBFW data ex-hibits the isolation by distance property, at least at short tomedium distances. e rela-tionship between geographic and genetic distances appears to plateau as the Euclideandistance increases [Figure 6.3].
Cluster-basedmethods, such as STRUCTURE [Pritchard et al., 2000] andGENELAND[Guillot et al., 2005], attempt to find sharp boundaries between clusters, to maximizesimilarity within versus between clusters, in terms of allele frequency distributions.[esemethods can assign individuals tomultiple clusters according to individual-specificfractional membership, but again the differences between clusters must be sharp in or-
44
Euclidean distance
Gen
etic
dis
tanc
e
0 5 10 15 20 25
0.3
0.4
0.5
0.6
CY,CYEF,EFTE,TE
CY,EFTE,CYTE,EF
Figure 6.3: Observed genetic distance vs.Euclidean distance for the (π
2) = 351 pairsin the RBFW dataset. Each point is col-ored according to group membership toemphasize that on average Cape York (CY)is closer to Eastern Forest (EF) than to TopEnd (TE).
der to estimate these proportions with certainty.] erefore, cluster-basedmethods arebetter suited to analyzing discrete population structure. However, genetic variation canexhibit continuous structure as genetic similarities tend to decay with distance and thedecay can be gradual rather than sharp as in Figure 6.3. In this case STRUCTURE effec-tively separates those individuals that are farthest apart in space as Top End (TE) andEastern Forest (EF) are assigned to different clusters.
e spatial structure of genetic variation in the RBFW data is continuous and there-fore it can be partially explained with isolation by distance. However, since the Carpen-tarian barrier may reduce gene flow between the TE and CY groups, we estimate thepatterns of migration rather than assume genetic differentiation increases as a func-tion of the Euclidean distance between sampling locations [or equivalently, migrationis uniform].
Figure 6.4: Irregular triangular grid (π, πΈ)spanning the habitat of the red-backedfairywren. Samples are assigned to theclosest deme. e method allows thatsampling be both sparse and uneven. Ifthe geographic information is coarse, it isappropriate to choose a coarse grid.
1-3
4-6
7 8,9
10,1112-14
15,16
17,1819,20
21,22
2627
28,29
30
To apply our method for estimating effective migration rates, we first construct anirregular triangular grid (π, πΈ) to cover the known range of the red-backed fairywren[Figure 6.4]. After running the MCMC chain from multiple random starting points tomonitor convergence and averaging the runs, we report the posterior mean of the effec-tive migration rates π = (ππΌπ½) in Figure 6.5, on the log base 10 scale, with low migra-tion in blue and high migration in brown. For this small dataset, it is computationallyfeasible to use the coalescent-based distancematrix (i.e., the expected coalescence timesππΌπ½) as well as its approximation in terms of effective resistances π πΌπ½. e two metrics
inferring effective migration from geographically indexed genetic data 45
produce very similar posterior estimates of the effective migration surface, shown inFigure 6.5 a) and b), respectively.
a) b) log10
(m
sm)
Ξ=T
TE
EF
CY
Ξ=R-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.5: Estimated relative rates of ef-fectivemigration for theRBFWdataset us-ing two distance metric on the graph: a)expected coalescence time π = (ππΌπ½); b)effective resistance π = (π πΌπ½).
6.1.1 What is the effect of the Carpentarian barrier on genetic differentiation?
Sincewe plot relativemigration rates, a completelywhitemigration surfacewould corre-spond to uniform migration; the colors indicate deviations from the expectation underuniform migration.
For the RBFW dataset, the most interesting feature is the area of lower effectivemigration that roughly covers the Cape York (CY) biogeographic region and the Car-pentarian barrier. is result is consistent with the hypothesis that the Carpentarianbarrier affects genetic differentiation. It is also consistent with the hypothesis that theCY group has a slightly lower effective population size (similar to the simulations inSection 5.2). Furthermore, CY is relatively less similar to TE than it is to EF as CY andTE are separated by longer effective distance [i.e., darker blue color]. Although this canalso be inferred from the PCA and STRUCTURE analysis, the effective migration plotcombines information about genetic dissimilarities and geographic distances and thusis an intuitive representation of spatial patterns in genetic variation.
Finally, we showdraws from the posterior distribution of effectivemigration [Figure6.6]. Although in most instances the region of the Carpentarian barrier falls inside atile of lower effective migration [relative to the rest of the habitat], there is a lot ofvariability in the location, shape and rate of the ''barrier''. One possible explanation isthat the Carpentarian barrier does not have a strong effect on the genetic structure ofthis species. However, the RBFWdataset is small and it is also possible that ourmethod
Figure 6.6: Draws from the posterior dis-tribution of effective migration in red-backed fairywrens.
46
cannot detect a strong barrier effect with certainty.
6.2 Forest and savanna elephants in Africa
Herewe present results for a dataset of African elephants sampled throughout the rangeof the species in Sub-Saharan Africa. e sample includes both forest elephants (Lox-odonta africana cyclotis) and savanna elephants (Loxodonta africana africana). Both sub-species are under threat, partly frompoaching, and thedatawere collected tohelp assigncontraband tusks to their location of origin [Wasser et al., 2004, 2007].
ere is observational and genetic evidence that forest and savanna elephants hy-bridize in the areas where their rangesmeet [Wasser et al., 2004]. erefore, we removeputative hybrids so that the dataset we analyze consists of 223 forest elephants and 896savanna elephants genotyped at 16 microsatellites. ese genetic markers were chosenin part because they can be isolated and amplified in samples of low quality and thusmicrosatellite DNA can be extracted from a small piece of tusk [Wasser et al., 2004].
Figure 6.7: Irregular triangular grid (π, πΈ)spanning the habitat of the African ele-phant. emap showsfive regions as iden-tified in [Wasser et al., 2004]. e westand central regions comprise the range ofthe forest elephant. e north, east andsouth regions comprise the range of thesavanna elephant. Samples are assignedto the closest deme in the grid.
[Wasser et al., 2004] show that forest and savanna elephants can be accurately dis-criminated. is is also evident in the PC scatterplot of the sample covariance matrix[Figure 6.8] where the leading principal component separates forest (West, Central) andsavanna (North, East, South) and explains 29% of the observed genetic variation. PCAanalysis also indicates that there is more genetic diversity in forest than in savanna ele-phants and suggests no further population structure within the two subspecies. How-ever, the sample configuration is very uneven with about 4 times savanna than forestelephants, so the PCA results might be biased [Section 4.5].
We applied our method to the data provided by [Wasser et al., 2004]. e resultsconfirm that forest and savanna elephants are genetically differentiated enough to dis-tinguish between the two subspecies. In Figure 6.9 we observe a prominent barrier ineffective migration that curves through the habitat to separate the west and central re-gions (the range of forest elephants) from the north, east and south regions (the rangeof savanna) elephants.
Our model estimates migration rates to explain the overall sample structure. How-ever, each genotyped site has its own genealogy and thus observed genetic distancesvary across sites. With microsatellites, mutation rates are higher, and since more mu-tations meanmore information about relative branch lengths, we can also fit the modelat each microsatellite separately [Figure 6.10]. ere is great variability in effective mi-
inferring effective migration from geographically indexed genetic data 47
Figure 6.8: PCA analysis of the forest andsavanna elephants (FS) data.
a) b) log10
(m
sm)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.9: Effective migration rates forforest and savanna elephants (a) using all16 microsatellites and (b) excluding themost variable locus.
gration across microsatellites. And the pattern of effective migration at the sixth locusproduces most of the overall pattern, except for the relationship between the west andcentral regions β essentially, the relationship among forest elephants. is suggeststhat elephants can be categorized with high accuracy as forest or savanna based on justthis one microsatellite.
We also split the sample into only forest and only savanna elephants to explore sub-tler population structure in each subspecies. e genetic variation in forest elephantsis consistent with isolation by resistance with a very small bridge of higher effectivemigration between the west and central regions. e genetic variation in savanna ele-phants deviates from isolation by distance considerably: the central region is separatedfrom the rest with a barrier of low effective migration while the south and east regionsare more genetically similar than the large area they span would suggest.
6.2.1 STRUCTURE and GENELAND results
As a clusteringmethod, GENELAND [Guillot et al., 2005] looks for distinct clusters andtherefore sharp boundaries between them. Removing putative hybrids makes differ-ences in allele frequencies between biogeographic regions easy to detect [Figure 6.12].However, GENELAND does not explain the relationship between the regions β theyare all distinct from each other.
48
Figure 6.10: Effective migration rates ateach of sixteen microsatellites. e 6thlocus is most variable, presumably due tohighest mutation rate.
a) b) log10
(m
sm)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.11: Effective migration rates for(a) only forest elephants and (b) only sa-vanna elephants, using the same triangu-lar grid as in Figure 6.7 and all 16 mi-crosatellites.
inferring effective migration from geographically indexed genetic data 49
On the other hand, STRUCTURE [Pritchard et al., 2000] with sampling locationprior [Hubisz et al., 2009] provides intuition for the relationship between the five bio-geographic regions [Figure 6.13]. It clearly detects the difference between forest ele-phants (west, central) and savanna elephants (north, east, south) as they fall into dif-ference clusters. Furthermore, STRUCTURE shows some evidence for isolation by dis-tance, particularly in savanna elephants. Most of these individuals are represented asweighted mixtures of four clusters that do not correspond to distinct geographic areas.
Figure 6.12: GENELAND posterior prob-abilities for belonging in each of five clus-ters, which correspond directly to the fivebiogeographic regions.
Figure 6.13: STRUCTURE membershipproportions in six clusters when usingsampling locations (not regions) to pro-vide prior information for cluster assign-ments.
50
both savanna and forest elephants(without hybrids)
only forest elephants
only savannah elephants
Figure 6.14: Scatterplots of genetic dis-similarities versus distances on the pop-ulation graph with a) uniform migrationand b) estimated effective migration.
inferring effective migration from geographically indexed genetic data 51
6.3 Human populations from Europe and Africa
egenetic structure of humanpopulationshas been extensively studied since [Menozziet al., 1978] first used PCA to summarize human genetic variation across continents.Here we analyze two large-scale genome-wide datasets. e European dataset is part ofthe POPRES collection [Nelson et al., 2008] and consists of 1387 individuals genotypedat 197,146 autosomal SNPs. Most samples were collected in Western Europe, so we an-alyze a subset of 1208 individuals from 15 countries. e Sub-Saharan African datasetconsists of 314 individuals from21 ethnic groups genotyped at 27,922 autosomal SNPs.
[Novembre et al., 2008, Lao et al., 2008] use PCA analysis to characterize the spatialstructure of genetic variation within Europe and find a close correspondence betweengenetic and geographic distances, and hence, evidence for isolation by distance. In fact,[Novembre et al., 2008] shows that the two leading principal components are stronglycorrelated with latitude and longitude, respectively. [Wang et al., 2012] use their Pro-crustes method to analyze the population structure within Africa [as well as within Eu-rope, Asia and world-wide] and also observe similarity between genetic and geographicmaps, after excluding hunter-gatherer populations.
e PCA plots reveal that the human population structure in Europe and Africa iscontinuous: while individuals from the same group tend to cluster together, the over-all arrangement qualitatively resembles the configuration of sampling locations [Figure6.15]. e correspondence between the PCA projections and the geographic map can beimproved with a rotation transformation such as Procrustes [Wang et al., 2010] but thiscannot improve the PCA analysis β for example, by correcting for biased sampling.
a) b)
Figure 6.15: Sample configuration andPCA analysis of the European and Africandatasets. In the bottom row font size in-dicates relative sample size.
52
Figure 6.15 also illustrates the limitations of the sampling scheme. In both cases it isbiased but, more importantly, geographic locations are implied as individuals from thesame populations are automatically assigned to the same coordinates. [For the Euro-pean dataset, populationmembership is determined based on grandparents' country oforigin or self-repored country of birth [Novembre et al., 2008].] is is clearly not rep-resentative of human spatial distribution in either Europe or Africa and, furthermore,the geographic information might be too coarse to detect substructure within popula-tions. [On the other hand, the stepping-stone model is discrete. Since observations areassigned to the nearest deme, the grid itself implies a limit on how much geographicresolution our method can represent.]
As [Novembre et al., 2008, Wang et al., 2012] have shown, the spatial structure ofhuman genetic variation in both Europe and Africa exhibits broad isolation by distanceas genetic differentiation tends to increase gradually with geographic distance [Figure6.16; top row]. However, while geography explains some patterns of genetic differen-tiation, a homogeneous habitat (i.e., uniform migration) might not provide the bestexplanation for the observed data.
We applied our method to estimate the effective human migration in both Europeand Africa and plotted genetic differentiation against the inferred effective distances[Figure 6.16; bottom row]. e linear relationship between genetic dissimilarity andeffective distance is stronger [r2 increases from 33% to 85% for the European data, andfrom 24% to 91% for the African data]. However, since there are so many pairwisecomparisons in the scatterplots, it is more instructive to analyze the inferred effectivemigration surface [Figure 6.17].
a) b)
Figure 6.16: Genetic differentiation (lin-earized πΉππ) versus resistance distance(π πΌπ½) with either uniform migration ratesor estimated effective migration rates onthe population graph (π, πΈ). e colors,whichmatch those in Figure 6.17, are cho-sen to emphasize the difference betweenthe populations in red and those in blue.
inferring effective migration from geographically indexed genetic data 53
a) b) log10 ( πsπ )
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.17: Inferred effective migrationsurfaces for the West European and Sub-Saharan African datasets.
From the inferred effectivemigration surface in Europe [Figure 6.17 a)] we canmakeseveral observations that are difficult to make from the PCA plot or the distance scat-terplot. For example, the northern countries (Ireland, the UK, Scotland, Denmark, theNetherlands, Germany β in blue) are more genetically similar that we would expectbased on the geographic distance alone. e same is true for the three southern coun-tries (Portugal, Spain and Italy β in red). On the other hand, a barrier to effectivemigration separates the British Isles and the Iberian peninsula, and another barrierseparates Italy and France [roughly where the Alps are]. However, we cannot concludethat the effect is due only to lower migration rates across bodies of water or mountainranges. e observed patterns can also be influenced by differences in effective popula-tion size and other evolutionary processes. Finally, the inferred migration surface alsosuggests that there is more differentiation in the north/south direction rather than theeast/west direction as the north and the south are separated by two areas of lower effec-tive migration. is results is consistent with the hypothesis that a north/south clineis a distinguishing feature of population structure within Europe [Tian et al., 2008].
We can also make interesting observations from the inferred effective migrationsurface in Africa [Figure 6.17 b)]. ere is higher effective migration along the Atlanticcoast than in the interior of the continent, and therefore, inland populations (in blue)are more genetically dissimilar than coastal populations (in red). Consequently, there ismore differentiation in the east/west direction than in the north/ south direction. ispattern can be observed in the PCA plot, as noted by [Wang et al., 2012], where pop-ulations along the coast cluster closer together, inland populations form more isolatedclusters and the E/W-associated principal component explains twice as much variationas the N/S-associated one. On the other hand, the four Bantu speaking groups at thesouthern tip cluster together in the PCA plot but not in the effective migration sur-face. However, this might be the result of lower geographic resolution in that region:Pedi and Sotho/Tswana are assigned to the same deme, and similarly, Nguni and Xhosaare assigned to another deme. We use the program GENEPOP [Rousset,
2008] to compute pairwise πΉππs.e first pair has lower genetic differentiation than the
second [πΉππ(ππ, ππ) = 0.0012, πΉππ(ππ, πβ) = 0.0019].ese patterns are present, to some extent, in the PCA plot and the distance scatter-
plot. But they are easy to observe only if we categorize the locations and color the points
54
appropriately. In contrast, the pattern is clear after the analysis of effective migration.
6.4 Arabidopsis thaliana in Europe and North America
Arabidopsis thaliana is a small flowering plant and a commonly studied model organismin population genetics. It has a broad natural rangeβEurope, Asia andNorth Africaβand now grows in North America as well. Although A. thaliana is a selfing plant with lowgene flow, its genetic variation has significant spatial structure [Nordborg et al., 2005,Platt et al., 2010]. On the continental scale, in Europe there is broad isolation by dis-tance with east-west gradient that has been interpreted as evidence for post-glaciationcolonization [Nordborg et al., 2005]. On the other hand, in North America there isgenome-wide linkage disequilibrium and haplotype sharing that have been interpretedas evidence for recent human introduction from Europe [Nordborg et al., 2005].
A large geographically referenced dataset is available from the Regional Mapping(RegMap) project [Horton et al., 2012]. We split the full dataset (1193 accessions, βΌ220, 000 SNPs) into two subsetsβNorth America (180 plants) and Europe (823 plants)βwhichwe analyze both separately and together. [We exclude plants fromAsia becausethe continent is very sparsely sampled.]
a) b)
Figure 6.18: Sample configuration andPCA analysis of Arabidopsis thaliana datafrom the RegMap project; a) North Amer-ica, b) Europe. First we perform principal components analysis [Figure 6.18]. As we would expect
if A. thaliana has different history in Europe and North America, there are differencesin genetic variation on the two continents [Nordborg et al., 2005]. ere is little pop-
inferring effective migration from geographically indexed genetic data 55
ulation structure in the North American subset: the samples are separated to someextent in a north/south direction, with no obvious separation between samples fromaround Lake Michigan and those from the Atlantic coast despite the geographic dis-tance. On the other hand, the population structure in the European subset is continu-ous, with some correspondence between genetic variation and geographic distribution,as we would expect under isolation by distance [Platt et al., 2010].
Next we apply our method to estimate the effective migration for A. thaliana inNorth America and Europe [Figure 6.19]. In North America, the two sampled regionsβ Lake Michigan and the Atlantic coast β are connected by a strip of high effectivemigration. is indicates that the regions are similar genetically even though they arefar apart in space. [is is the opposite of what we expect under isolation by distance.]ere is an area of higher effective migration at the south tip of Lake Michigan wheremost of the North American samples are collected. erefore, our results are consistentwith the observation that there is extensive haplotype sharing (which indicates identityby descent) not onlywithin but also between sampling locations [Nordborg et al., 2005].
Figure 6.19: Inferred effective migrationsurfaces for Arabidopsis thaliana from twoRegMap subsets; a) North America; b) Eu-rope; c) North America and Europe com-bined.
In continental Europe, the overall pattern in broad isolation by distance, with smallvariability in effective migration rates. On the other hand, the north of the British Islesis separated from the rest of Britain which in turn is connected to northern France byan area of high migration. Our results are consistent with previous studies of the pop-ulation structure of A. thaliana. [Platt et al., 2010] find that in Eurasia there is a strongtrend of isolation by distance (at three distance scales) while in North America thereis no relationship between geographic distance and allelic similarity (except at fine dis-tance scale). And [Horton et al., 2012] observe that in the PCA plot most accessionsfrom the British Isles are projected closest to France but some plants from Britain clus-ter with lines from Sweden. Our method summarizes and visualizes these patterns.
56
Figure 6.20: Genetic differentiation (lin-earized πΉππ) versus resistance distance(π πΌπ½) with either uniform migration ratesor estimated effective migration rates onthe population graph (π, πΈ). e sam-ples are assigned to a regular triangulargrid [because the designation of nationsas populations is not relevant] and thecolors are chosen to emphasize interest-ing patterns [that correspond to regionswith strong deviation from isolation bydistance].
inferring effective migration from geographically indexed genetic data 57
In the combined analysis, the overall pattern is a prominent corridor of high effec-tivemigration across the Atlantic ocean. While effectivemigration rates are symmetric,this supports the hypothesis that recent directedmigration introduced A. thaliana fromEurope to the New World. On this larger scale, effective migration rates within NorthAmerica and Europe are low [as we would expect in a plant species with low gene flow].
[Platt et al., 2010] considers two models of the continuous population structure ofA. thalianaβ amodel with constant uniform migration and another with uniform mi-gration and a single shift in dispersal rate β and concludes that neither version is agood fit for the observed patterns of genetic dissimilarities. at is, even though theoverall pattern is consistent with isolation by distance, there might be deviations fromuniform migration (or too much noise). We can observe such complex details in theeffective migration surface, e.g., the British Isles in Figure 6.19. When we combine thetwo samples the strongest signal in the data is the genetic similarity between the twocontinents even though they are separated by the expanse of the Atlantic ocean. isillustrates howwe can observe finer details at smaller scales because effectivemigrationrates are parametrized relative to the overall mean log rate.
6.5 Conclusion
Genetic variation in natural populations often exhibits spatial structure as genetic sim-ilarity tends to decay with geographic distance. However, this relationship is often nothomogeneous and the distribution of similarities across the habitat contains informa-tion about the evolutionary and ecological history of the population.
Visualization is an important tool for detecting and understanding patterns of pop-ulation structure. We have developed a model-based method for estimating and visu-alizing effective migration to explain observed deviations from homogeneous dispersaland isolation by distance. It represents the population as a triangular grid of discretecomponents and its effectivemigration as a colored partition of a two-dimensionalmap.
Our method is particularly useful for characterizing continuous population struc-ture (even though the underlying stepping-stone model is discrete) because it models aspatially distributed population instead of a collection of distinct and isolated subpop-ulations. Neither is it necessary to categorize samples into regional groups to computemeasures of differentiation such as πΉππ . [Nevertheless, assigning colors and names togroups of genetically similar individuals, in areas of high effective migration, can behelpful for subsequent analysis.] If spatial structure is continuous, clustering samplesinto biogeographic regionsmight not be awell-definedproblem. In this scenario cluster-based methods such as STRUCTURE and GENELANDmay be inappropriate.
Our method also offers some advantages over PCA analysis. PCA produces two-dimensional visual summaries of observed genetic variation and can capture both con-tinuous and discrete population structure at the sample level. In contrast, our methodproduces a visual representation of geographic and genetic information at the popula-tion level. Consequently, it is easier to make qualitative comparisons between popula-tions or between geographic regions, in terms of both geographic and genetic distances.And our method can detect deviations from uniformmigration (and hence, isolation bydistance) that are not clearly evident in PC projections because PCA is strongly affectedby sampling bias and does not estimate relevant demographic parameters.
7
Appendices
7.1 Properties of the stepping-stone model of population subdivision
egoal of this section is to derive the systemof linear equations (2.15) for the expectedpairwise coalescence times π = (ππΌπ½) as a function of the migration rates π = (ππΌπ½)and the coalescence rates π = (ππΌ) in the stepping-stone model.
7.1.1 Probabilities of identity by descent
In population genetics, the probability of identity is a measure of relatedness due toshared ancestry. e concept of identity can be defined as either the event that the lin-eages have the same ancestor in a reference population at a specified time in the past orthe event that no mutations have occurred since the lineages diverged from their mostrecent common ancestor. We use the second definition of identity known as identity bystate [versus identity by descent].
Let ππΌπ½(π) be the probability of identity by descent for two distinct lineages drawnat random from demes πΌ and π½. e parameter π = 2π0π’ is the mutation rate per2π0 generations for a single lineage, or equivalently, the total mutation rate per π0generations for a pair of lineages.
To derive expressions for ππΌπ½(π) for every pair (πΌ, π½), consider the history of a sam-ple of size 2 backwards in time. Let π₯(π‘) = {π₯(π‘)
πΌ } be the state of the ancestral processπ‘ generations ago when the sample has π₯(π‘)
πΌ ancestors in deme πΌ. It is convenient toconsider time in units of π0 generations. On this timescale and under certain assump-tions about reproduction and migration, the discrete-time ancestral process {π₯(π‘) βΆ π‘ =0, 1, β¦ } converges to a continuous-time ancestral process {π₯(π‘) βΆ π‘ β₯ 0}, called thestructured coalescent [Notohara, 1990, 1993]. Mutations are generated by a Poisson pro-cess with intensity π such that in π‘ units of time a lineage accummulates πΎ βΌ Po(ππ‘)mutations.
To derive the probability of identity for the pair (πΌ, π½), consider the first event thatresults in a change of state. e initial state is π₯(0) = {π₯(0)
πΌ = 1, π₯(0)π½ = 1, π₯(0)
πΎ = 0 βΆπΎ β πΌ, πΎ β π½}. If the two lineages are drawn from the same deme, i.e., πΌ = π½, the firstevent can be a coalescence with rate ππΌ, a mutation with rate π, or a migration to demeπΎ with rate 2ππΌπΎ .Since the process starts with two lineages
in πΌ and the migration rate from πΌ toπΎ is ππΌπΎ for a single lineage, the totalrate of movement is 2ππΌπΎ . Similarly, thecombined mutation rate is π.
If a mutation occurs, the lineages are no longer identical by descent.erefore, under equilibrium,
ππΌπΌ(π) = ππΌπ + ππΌ + 2ππΌ
+ βπΎβΆπΎβ πΌ
2ππΌπΎπ + ππΌ + 2ππΌ
ππΌπΎ(π), (7.1)
where ππΌ = βπΎβΆπΎβ πΌ ππΌπΎ is the total rate of migration out of πΌ. More precisely, sincethe coalescent proceeds backwards in time, ππΌπΎ is the rate at which offspring in πΌ have
inferring effective migration from geographically indexed genetic data 59
parents from πΎ and ππΌ is the total rate of ''outside-deme'' parentage.When the two lineages are drawn from two different demes, i.e., πΌ β π½, they cannot
coalesce in a single step. In this case, the first event can be a mutation with rate π, amigration from deme πΌ to deme πΎ with rate ππΌπΎ , or a migration from deme π½ to demeπΎ with rate ππ½πΎ . Under equilibrium,
ππΌπ½(π) = βπΎβΆπΎβ πΌ
ππΌπΎπ + ππΌ + ππ½
ππΎπ½(π) + βπΎβΆπΎβ π½
ππ½πΎπ + ππΌ + ππ½
ππΌπΎ(π). (7.2)
Equations (7.1) and (7.2) represent a system of linear equations for the probabilities ofidentity by descent in terms of the mutation rate π, the coalescence rates ππΌ and themigration rates ππΌπ½. In matrix notation,
diag {π}[ diag {Ξ¦} β πΌ] = [π β (π/2)πΌ]Ξ¦ + Ξ¦[π β (π/2)πΌ]β². (7.3)
Here π = (ππΌπ½) is the infinitesimal generator of the migration process with diagonalentries βππΌ = β βπΎβΆπΎβ πΌ ππΌπΎ , Ξ¦ β‘ Ξ¦(π) = (ππΌπ½(π)) is the matrix of probabilities ofidentity at fixed mutation rate π, and π = (ππΌ) is the vector of coalescence rates.
7.1.2 Expected pairwise coalescence times
A linear system for the expected pairwise coalescence times can be derived correspond-ingly. Since by definition π is the probability that no mutation occurs in either lineagebefore coalescence at time π‘,
π(π) = P{πΎ = 0} = E{πβππ‘}. (7.4)
at is, the probability of identity π is the Laplace transform of the coalescence time π‘[Hudson, 1990]. erefore,
E{π‘} = βπβ²(0) where πβ² =β
βππ. (7.5)
To obtain a system for the expected coalescence times, differentiate equations (7.1) and(7.2) with respect to the mutation rate π and evaluate at π = 0. e result is
1 = (ππΌ + 2ππΌ)ππΌπΌ β βπΎβΆπΎβ πΌ
2ππΌπΎππΌπΎ and (7.6a)
1 = (ππΌ + ππ½)ππΌπ½ β βπΎβΆπΎβ πΌ
ππΌπΎππΎπ½ β βπΎβΆπΎβ π½
ππ½πΎππΌπΎ . (7.6b)
Equivalently, in matrix notation,
diag {π} diag {π} β ππ β ππβ² = 11β². (7.7)
[Here 11β² is a π Γ π matrix of 1s.] is method for deriving equations (7.6a) and (7.6b)is developed in [Bahlo and Griffiths, 2001]. Alternatively, [Hey, 1991] constructs aMarkov chain with π(π + 1)/2 non-absorbing states for each unique pair (πΌ, π½). eset of states includes π homoallelic states, when the two lineages are in the same deme,and π(π β 1)/2 heteroallelic states, when two lineages are in different demes. ereis also an absorbing state, which corresponds to coalescence. Transition probabilitiesbetween all these states reflect the migration rates ππΌπ½ and coalescence rates ππΌ.
Furthermore, since the population evolves under equilibrium, migration is conser-vative and πβ²πβ1 = 0 by definition. If we multiply equation (7.7) by (πβ1)β² on the left
60
and by πβ1 on the right, we obtain
1β² diag {π}πβ1 = (1β²πβ1)2 β βπΌ
ππΌπΌ(ππΌ/π0) = ( βπΌ
ππΌ/π0)2
π0 β‘ βπΌ
ππΌπΌ(ππΌ/ππ) = ππ/π0 = π (7.8)
where π0 is the [weighted] average within-deme coalescence time, ππ = βπΌ ππΌ is thetotal population size and π0 = ππ/π is the coalescent timescale. erefore, underconservative migration, the average within-deme coalescence time does not depend onthe exact pattern and rates of migration [Strobeck, 1987]. If migration is isotropic β amuch stronger assumption β the within-deme coalescence times ππΌπΌ for all demes πΌdo not depend on the migration process.
7.2 Distance matrices
Here we discuss distance matrices, also called dissimilarity matrices, and review somerelevant properties. Two examples of a distance matrix are the matrix of expected pair-wise coalescence times, π, and the matrix of effective resistance distances, π .
First we state two equivalent definitions of a distance matrix.
Definition 7.1 e matrix π· = (π2ππ) is a distance matrix if there exist squared lengths
β = (β2π ) β βπ+ such that
β1β² + 1ββ² β π· β½ 0. (7.9)
Definition 7.2 ematrixπ· = (π2ππ)ππ is the set of symmetric π Γ π matrices;
ππ+ is the set of symmetric π Γ π matriceswith nonnegative elements.
is a distancematrix if there exists pairwise similaritiesπ = (πππ) β ππ+ such that
π2ππ = πππ + πππ β 2πππ. (7.10)
Let π = (π₯1, β¦ , π₯π)β² β βπΓπ represent π points in π-dimensional inner product space.For example, in the setting of analyzing population structure, π₯π is a genotype vector ofπ polymorphic sites. en the squared distance between points π and π is given by
π2ππ = β¨π₯π β π₯π, π₯π β π₯πβ© = β¨π₯π, π₯πβ© + β¨π₯π, π₯πβ© β 2β¨π₯π, π₯πβ© β‘ β2
π + β2π β 2πππ, (7.11)
where πππ = β¨π₯π, π₯πβ© is the inner product between two vectors in βπ, and π = ππβ² ispositive definite as a Gram matrix. In matrix notation,
π· = diag {π}1 + 1 diag {π}β² β 2π. (7.12)
Clearly the similarity matrix π contains more information about π than the dissimilar-ity matrix π·: diag {π} = βwhile diag {π·} = 0π· is nonnegative with 0s on the main
diagonal because the dissimilarity of apoint with itself is trivially 0.
. at is, π captures the absolute positionof each point in the space (the length βπ is the distance to the center π) while π· reflectsonly the relative difference for each pair of points.
eorem 7.1 ematrix π· β π»π is a distance matrix if and only if it is conditionally nega-tive definite.
Sketch of proof.
β’ Suppose thatπ· is a distancematrix. For every vector πΌ β βπ such that 1β²πΌ = 0 (thatis, πΌ is a contrast)
0 β€ 2πΌβ²π·πΌ = πΌβ²(β1β² + 1ββ² β π·)πΌ (7.13a)
= πΌβ²β(1β²πΌ) + (πΌβ²1)ββ²πΌ β πΌβ²π·πΌ = βπΌβ²π·πΌ. (7.13b)
erefore, π· is conditionally negative definite.
inferring effective migration from geographically indexed genetic data 61
β’ Suppose that π· is conditionally negative definite. Choose a vector π€ β βπ such that1β²π€ = 1 and define
π = πΌ β 1π€β², (7.14a)
π = β1
2ππ·πβ² = β
1
2(πΌ β 1π€β²)π·(πΌ β π€1β²). (7.14b)
en π is a centering matrix such that
ππ = πΌ β 1π€β² β 1π€β² + 1(π€β²1)π€β² = πΌ β 1π€β² = π, (7.15a)
π€β²ππ₯ = π€β²π₯ β (π€β²1)π€β²π₯ = π€β²π₯ β π€β²π₯ = 0 (7.15b)
for every π₯ β βπ. at is, π is an orthogonal projection onto the hyperplane {π€}β.Furthermore, (πΌ β π€1β²)π₯ at is, 1β²(πΌ β π€1β²)π₯ = 0.is a contrast and
π₯β²(πΌ β 1π€β²)π·(πΌ β π€1β²)π₯ = β2π₯β²ππ₯ β€ 0 (7.16)
since π· is conditionally negative definite. erefore, Using equation (7.12) the (π, π)-th element is
β1
2ππ(ππ·πβ²)ππ β
1
2ππ(ππ·πβ²)ππ + ππ(ππ·πβ²)ππ
= β1
2[π€β²π·π€ β π€β²π·ππ β πβ²
ππ·π€]
β1
2[π€β²π·π€ β π€β²π·ππ β πβ²
ππ·π€]
+ π€β²π·π€ β π€β²π·ππ β πβ²ππ·π€ + π·ππ = π·ππ
where π·ππ = 0 and ππ is the π-th standard basisvector.
π is a positive definite matrixand it has a decomposition π = ππβ². It is straightforward to show that
π· = diag { β1
2ππ·πβ²}1β² + 1 diag { β
1
2ππ·πβ²}β² + ππ·πβ². (7.17)
at is, the vectorsπ = (π¦1, β¦ , π¦π)β² generate the distance matrixπ·. However, notethat the similarity matrix π depends on the choice of π€. It is not surprising thatπ· does not determine π (nor π) uniquely since it contains information only aboutrelative differences.
e vector π€ determines the position of the origin π. e condition 1β²π€ = 1 implies that π€ is avector of weights.
For example, π€ = 1/π cor-responds to placing the origin at the centroid (the center of mass) 1β²π/π = sπ¦ andπ€ = ππ βat the πth point πβ²
ππ = π¦π. Different decompositions π = ππβ² give differentorientations about the origin π.
l
Nowwe consider the special case A covariance matrix is a circumhyper-
sphere with radius π2 and a correlationmatrix β with radius 1.
when the lengths βπ are all equal to π for some π > 0 andthus the points π₯π are the same distance from the center π. Geometrically, the pointslie on the circumference of a sphere with radius π in βπ [Gower, 1985] and so π· is calleda spherical distance metric. is puts a constraint on the choice of π€. In general,
diag {π} =1
2diag {1π€β²π· + π·π€1β² β 1π€β²π·π€1β² β π·} diag {π·} = 0 and diag {π€1β²} = π€.(7.18a)
= π·π€ β1
2(π€β²π·π€)1. (7.18b)
If β = π21, then diag {π} = β = π21. erefore,
π·π€ β1
2(π€β²π·π€)1 = π21. (7.19)
If π· is nonsingular, π·β1 exists and we can right-multiply by π·β1. en
π€ = (12π€β²π·π€ + π2)π·β11. (7.20)
Recall that π€ satisfies π€β²1 = 1, so that
1β²π€ = (12π€β²π·π€ + π2)1β²π·β11 = 1 (7.21)
is implies 1β²π·β11 β 0 and
π€ = π·β111β²π·β11 , In this case π = πΌ β 11β²π·β1/1β²π·β11
is the orthogonal projection onto the
hyperplane {π·β11}β.
π2 = 1/21β²π·β11 . (7.22)
us we have proved the following
62
Corollary 7.1 Suppose that π· β π»π is a distance matrix such that det {π·} β 0. en1β²π·β11 > 0.
7.3 Conditional definite matrices
Here we derive a sufficient condition for positive definiteness of the covariance matrixΞ£ = 11β² β πΞ where Ξ β π»π is a distance matrix [or more generally, a conditionallynegative definite matrix] and π is a positive constant. e derivation is based on [Bapatand Raghavan, 1997] and uses the Spectraleoremwhich states thatΞ£ β» 0 if and onlyif all its eigenvalues are positive.
Definition 7.3 A matrix Ξ β ππ is conditionally negative definite if
πΌβ²ΞπΌ β€ 0 (7.23)
for all πΌ β βπ such that βπ πΌπ = πΌβ²1 = 0. us, conditional negative definiteness isequivalent to negative definiteness on the subspace {1}β.
eorem 7.2 A conditionally negative definite (c.n.d) matrix Ξ β ππ has at most one posi-tive eigenvalue.
Sketch of proof. We consider the c.n.d. case. Suppose to the contrary that Ξ hastwo positive eigenvalues π’1 > 0 and π’2 > 0 with corresponding eigenvectors π₯ and π¦.Without loss of generality, we can assume that the eigenvectors are normalized so thatβπ π₯π = βπ π¦π β βπ(π₯π β π¦π) = 0. at is, (π₯ β π¦)β²1 = 0 and π₯ β π¦ is a contrast.Furthermore,
(π₯ β π¦)β²Ξ(π₯ β π¦) = π₯β²Ξπ₯ + π¦β²Ξπ¦ β π¦β²Ξπ₯ β π₯β²Ξπ¦ (7.24a)
= π’1π₯β²π₯ + π’2π¦β²π¦ β π’1π¦β²π₯ β π’2π₯β²π¦π₯ β π¦ β π₯β²π¦ = 0 (7.24b)
= π’1π₯β²π₯ + π’2π¦β²π¦ > 0 (7.24c)
since π’1, π’2 > 0. is contradicts the definition of conditionally negative definite ma-trices. l
A distance matrix Ξ is nonnegative, with main diagonal of 0s, and is also conditionallynegative definite byeorem 7.1.
Corollary 7.2 Suppose that Ξ is a nonnegative, nonzero symmetric matrix. en Ξ has atleast one positive eigenvalue.
Sketch of proof. Since Ξ is symmetric, by the Spectral eorem it has real eigenvaluesπ’ = {π’π}. Furthermore, tr {Ξ} = βπ
π=1 π’π β₯ 0. e trace of Ξ is nonnegative because Ξis nonnegative; its eigenvalues are not all zero because Ξ is nonzero. Since βπ π’π β₯ 0,at least one of the eigenvalues is positive. l
Corollary 7.3 Suppose thatΞ is a nonnegative, nonzero, conditionally negative definite ma-trix. en Ξ has exactly one positive eigenvalue.
Sketch of proof. Byeorem 7.2 Ξ has at most one positive eigenvalue while by Corol-lary 7.3 it has at least one positive eigenvalue. erefore, it has exactly one positiveeigenvalue. If Ξ is strictly c.n.d, its other π β 1 eigenvalues are negative. l
inferring effective migration from geographically indexed genetic data 63
So far we know that Ξ is both conditionally negative definite and nonnegative, andtherefore, it has exactly one positive eigenvalue. On the other hand, Ξ£ = 11β² β πΞis conditionally positive definite: for all πΌ such that πΌβ²1 = 0, we have
πΌβ²Ξ£πΌ = πΌβ²(11β² β πΞ)πΌ = (πΌβ²1)2 β π(πΌβ²ΞπΌ) β₯ 0. (7.25)
erefore, Ξ£ has at most one negative eigenvalue. Finally, by the matrix-determinantlemma for a rank-one update,
πβπ=1
π’βπ = det {Ξ£} = (1 β 1β²Ξβ11
π ) det { β πΞ} = (1 β 1β²Ξβ11π )
πβπ=1
(βπ)π’π, (7.26)
where π’ = {π’π} are the eigenvalues of Ξ, π’β = {π’βπ } are the eigenvalues of Ξ£.
To ensure that the π’βπ s are positive, we use the fact that the product on the left in
equation (7.26) has atmost one negative termwhile the product on the right has exactlyone negative term. erefore, a necessary and sufficient condition for Ξ£ β½ 0 is that πsatisfies
1 β 1β²Ξβ11π β€ 0. (7.27)
7.4 Restricted maximum likelihood (REML) in the general case
Consider the model π βΌ N(ππ½, Ξ£) with design matrix π and covariance matrix Ξ£. LetπΎ β βπΓπ be a basis for the mean space and πΏ β β(πβπ)Γπ be a basis for the residualspace. For example, πΎ = π if the design matrix has full rank, or otherwise, πΎ is πlinearly independent columns of π. By construction πΏπΎ = 0 and ker {πΏ} = span {πΎ}.Also let π[Ξ£] be the unique orthogonal projection with kernel πΎ given by
π = πΌ β πΎ(πΎβ²Ξ£β1πΎ)β1πΎβ²Ξ£β1. (7.28)
[McCullagh, 2009] shows that regardless of the choice for πΏ is, π has an equivalentcharacterization given by
π = Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ. (7.29)
To prove this, it is sufficient to show that
β’ Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ is a projection:ππ = (Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ)(Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ) = Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ = π
β’ Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ is self-adjoint with respect to the inner product β¨π’, π£β© = π’Ξ£β1π£:πβ²Ξ£β1 = (Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ)β²Ξ£β1 = πΏβ²(πΏΞ£πΏβ²)β1πΏ = Ξ£β1(Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ) = Ξ£β1π
β’ ker {Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ} = {πΎ}: ππΎ = (Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ)πΎ = Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ(πΏπΎ) = 0
e orthogonal projection with kernel πΎ (i.e., the orthogonal projection onto the resid-ual space) is unique, so (7.28) = (7.29).
To rewrite the Wishart log-likelihood in equation (4.13), we derive the followingexpressions involving Ξ£, π, πΏ and πΎ.
det {πΏΞ£πΏβ²}β1 det {πΏπΏβ²} = det {(πΏΞ£πΏβ²)β1(πΏπΏβ²)}= Det {πΏβ²(πΏΞ£πΏβ²)β1πΏ}= Det {Ξ£β1Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏ}= Det {Ξ£β1π} (7.30)
64
where the standard determinant, denoted by det, is the product of all eigenvalues andthe generalized determinant, denoted by Det, is the product of the nonzero eigenval-ues. e first equality holds because (πΏΞ£πΏβ²)β1πΏπΏβ² is (πβπ)Γ(πβπ)with πβπ nonzeroeigenvalues and πΏβ²(πΏΞ£πΏβ²)β1πΏ is π Γ π with π β π nonzero eigenvalues and the two ma-trices have the same nonzero eigenvalues:
If (πΏΞ£πΏβ²)β1πΏπΏβ² = ππ’, then πΏβ²(πΏΞ£πΏβ²)β1πΏ(πΏβ²π’) = π(πΏβ²π’).
If πΏβ²(πΏΞ£πΏβ²)β1πΏπ£ = ππ¦, then πΏπΏβ²(πΏΞ£πΏβ²)β1πΏπ£ = ππΏπ¦,(πΏΞ£πΏβ²)β1πΏπ£ = (πΏπΏβ²)β1πΏπ¦,
(πΏΞ£πΏβ²)β1(πΏπΏβ²)(πΏπΏβ²)β1πΏπ¦ = (πΏπΏβ²)β1πΏπ¦.
Similarly,
det {πΎβ²Ξ£β1πΎ}β1 det {πΎβ²πΎ} = Det {Ξ£(πΌ β π)β²}. (7.31)
Following [Verbyla, 1990], let π΄ = [πΎ, πΏβ²]. Using both characterizations of the projec-tion π and the formula for the determinant of a block matrix,
det {π΄β²Ξ£π΄} = detββββββ
πΎβ²Ξ£πΎ πΎβ²Ξ£πΏβ²
πΏΞ£πΎ πΏΞ£πΏβ²
ββββββ
= det {πΏΞ£πΏβ²} det {πΎβ²Ξ£πΎ β πΎβ²Ξ£πΏβ²(πΏΞ£πΏβ²)β1πΏΞ£πΎ}
= det {πΏΞ£πΏβ²} det {πΎβ²[πΌ β πΏβ²(πΏΞ£πΏβ²)β1πΏΞ£]Ξ£πΎ}= det {πΏΞ£πΏβ²} det {πΎβ²[πΌ β π]Ξ£πΎ}= det {πΏΞ£πΏβ²} det {πΎβ²[πΎ(πΎβ²Ξ£β1πΎ)β1πΎβ²Ξ£β1]Ξ£πΎ}= det {πΏΞ£πΏβ²} det {πΎβ²πΎ(πΎβ²Ξ£β1πΎ)β1πΎβ²πΎ}= det {πΏΞ£πΏβ²} det {πΎβ²πΎ} det {πΎβ²Ξ£β1πΎ}β1 det {πΎβ²πΎ};
det {π΄β²π΄} = detββββββ
πΎβ²πΎ πΎβ²πΏβ²
πΏπΎ πΏπΏβ²
ββββββ
= detββββββ
πΎβ²πΎ 00 πΏπΏβ²
ββββββ
= det {πΎβ²πΎ} det {πΏπΏβ²}.
Since π΄ is full-rank by construction,
det {Ξ£} = det {π΄β²Ξ£π΄}det {π΄β²π΄} = det {πΎβ²πΎ} det {πΏΞ£πΏβ²}
det {πΏπΏβ²} det {πΎβ²Ξ£β1πΎ} . (7.32)
Finally, by applying first (7.30) and then (7.31),
det {Ξ£} = det {πΎβ²πΎ}[ det {πΎβ²Ξ£β1πΎ} Det {Ξ£β1π}]β1(7.33a)
= det {πΏπΏβ²}β1 det {πΏΞ£πΏβ²} Det {Ξ£(πΌ β π)β²}. (7.33b)
7.4.1 Restricted maximum likelihood (REML) in a special case
Rather than a general covariancematrixΞ£, our model for population structure in termsof distances on a population graph specifies
Ξ£ = 11β² β πΞ, (7.34)
where Ξ is a conditionally negative definite matrix such that 1β²Ξβ11 = 1. e normal-ization simplifies notation and defines equivalence classes {Ξβ βΆ (1β²Ξβ1β 1)Ξβ = Ξ}. It is
inferring effective migration from geographically indexed genetic data 65
also convenient because under this parametrization π β (0, 1) is a sufficient conditionfor Ξ£ β» 0, as we show in Appendix 7.3.
Because the covariancematrix has the form (7.34) we can avoid explicitly construct-ing Ξ£ and instead work with Ξ. Using the Sherman-Morrison formula for the inverseof a rank-one update,
Ξ£β1 = β 1π(Ξβ1 β Ξβ111β²Ξβ1
1 β π ) (7.35a)
Ξ£β11 = β 1π(Ξβ11 β Ξβ11
1 β π) = 11 β πΞβ11 (7.35b)
1β²Ξ£β11 = 11 β π (7.35c)
e orthogonal projection π = π[Ξ£] with kernel πΎ = 1 is given by
π = πΌ β 11β²Ξ£β1
1β²Ξ£β11 = πΌ β 11β²Ξβ1 (7.36a)
Ξ£β1π = β 1πΞβ1(πΌ β 1
1 β π11β²Ξβ1)π = β 1πΞβ1π (7.36b)
e projection matrix π is not symmetric in general but for every π,
πβ²Ξ£β1 = πβ²Ξ£β1π = Ξ£β1π and πβ²Ξβ1 = πβ²Ξβ1π = Ξβ1π. (7.37)
at is, Ξ£β1π and Ξβ1π are symmetric.Now let's express Det {Ξ£β1π} as a function of Det {Ξβ1π} where the generalized
determinant Det is the product of the nonzero eigenvalues. Since
Ξβ1ππ£ = πΌπ£ β Ξ£β1π = β 1π(πΌπ£), (7.38)
the two generalized eigenvalue problems are equivalent up to a proportionality constantand
Det {Ξ£β1π} = ( 1π)
πβ1Det { β Ξβ1π} = Det { β Ξβ1π/π}. (7.39)
where rank {Ξ£β1π} = rank { β Ξβ1π} = π β 1. Finally, we apply equation (7.32) withΞ£ = π and πΎ = 1 to obtain
det {πΏππΏβ²} = det {π} det {1β²πβ11}det {πΏπΏβ²}det {1β²1} , (7.40)
and equation (7.30) with Ξ£ = 11β² β πΞ and πΏΞ£πΏβ² = βπ(πΏΞπΏβ²) to obtain
det { β (πΏΞπΏβ²)β1/πβ} = Det {Ξ£β1π/π2}/ det {πΏπΏβ²}= Det { β Ξβ1π/πβ}/ det {πΏπΏβ²} (7.41a)
tr { β (πΏΞπΏβ²)β1πΏππΏβ²/πβ} = tr { β (πΏβ²(πΏΞπΏβ²)β1πΏ)π/πβ}= tr { β Ξβ1Ξ(πΏβ²(πΏΞπΏβ²)β1πΏ)π/πβ}= tr { β (Ξβ1π)π/πβ}. (7.41b)
7.5 Efficient computation
Let π = (π πΌπ½) be the matrix of effective resistances between pairs of observed demes(πΌ, π½) in the population graphπΊ = (π, πΈ, π). From [McRae, 2006]we know thatππΌπΌ β
66
π and ππΌπ½ β π(1+π πΌπ½/4)where π is the number of demes in the population graph andπ is the number of observed demes. With this motivation, let
(ΞπΌπ½) = π(1π1β²π + (π πΌπ½)/4 β πΌπ) (7.42)
be the matrix of (expected) pairwise distances between observed demes.If individuals are exchangeable within demes, we can model distances between in-
dividuals in terms of distances between demes. For a pair (π β πΌ, π β π½),
Ξππ = π(1 + π πΌπ½/4 β π{π=π}). (7.43)
Equivalently, in matrix notation,
(Ξππ) = π(1π1β²π + π½π π½β²/4 β πΌπ) (7.44)
where π½ = (π½ππΌ) β β€πΓπ is an indicator matrix such that
π½ππΌ =β§{β¨{β©
1 if π β πΌ0 if π β πΌ
. (7.45)
To simplify the notation, we will drop the subscripts and write plainly 1 for the vectorof ones and πΌ for the identity matrix. e dimension will be clear from the context if wekeep in mind that π = (π πΌπ½) is an π Γ π matrix and Ξ = (Ξππ) is an π Γ π matrix.
To evaluate the Wishart log-likelihood in equation (4.13) we need to compute theterms tr {Ξβ1ππ} and Det { β Ξβ1π} where
π = πΌ β 11β²Ξβ1
1β²Ξβ11 (7.46)
is a projection matrix, which removes the common mean, and π is the observed sim-ilarity matrix. We also standardize the distance matrix Ξ so that 1β²Ξβ11 = 1. Withthis normalization, multiplying Ξ by a (positive) constant has no effect on the productΞβ1π, so we can ignore the scale π in equation (7.44).
e distance matrix Ξ = 11β² + π½π π½β²/4 β πΌ has an ''almost-block'' structure, exceptfor the diagonal of zeros: specifically, Ξ = π½π΅π½β² βπΌ where π΅ = π /4+11β² is a known πΓπmatrix. [π΅ is a function of the migration rates.] e inverse Ξβ1 is also an almost-blockmatrix:
Ξβ1 = π½ππ½β² β πΌ, (7.47)
where π is an unknown π Γ π matrix. Since ΞΞβ1 = πΌ, the solution π must satisfy
π½π΅πΆππ½β² β π½π΅π½β² β π½ππ½β² + πΌ = πΌ, (7.48a)
π½β²(π΅πΆ β πΌ)ππ½β² = π½π΅π½β², (7.48b)
Where πΆ = π½π½β² = diag {ππΌ} is the diagonal matrix of sample counts.Since every term in equation (7.48b) has an exact block structure which depends
on the sample configuration through π½, it is sufficient to solve the lower-dimensionalproblem
(π΅πΆ β πΌ)π = π΅ β (πΆ β π΅β1)π = πΌ. (7.49)
is is a system of linear equations for the unknown π in terms of the effective resis-tances π and the counts πΆ, and therefore, it can be solved efficiently without matrix
inferring effective migration from geographically indexed genetic data 67
inversions. e diagonal matrix πΆ is invertible because here we consider only observeddemes, i.e., locations with at least one observation; the auxiliary matrix π΅ = π /4 + 11β²
is invertible because the matrix of effective resistances π is invertible.Once we solve forπ, we could explicitly constructΞβ1 fromπ according to equation
(7.47). However, this is not necessary because we only need to compute Det { β Ξβ1π}and tr {Ξβ1ππ}where π is the (average) observed similarity. Using the definition of theorthogonal projection π (equation 7.46) and the properties of the trace,
tr {Ξβ1ππ} = tr {Ξβ1π} β 11β²Ξβ11 tr {11β²Ξβ1πΞβ1}. (7.50)
We consider each of these terms in turn:
1β²Ξβ11 = 1β²(π½ππ½β² β πΌ)1 = tr {π(π½β²11β²π½)}βπ, (7.51a)
tr {Ξβ1π} = tr {(π½ππ½β² β πΌ)π}= tr {π(π½β²ππ½)}β tr {π}, tr {ππ} = βπΌ,π½ ππΌπ½ππΌπ½. So the trace
can be computed as sum(sum(Y.*T)).
(7.51b)
tr {11β²Ξβ1πΞβ1} = 1β²πΆπ(π½β²ππ½)ππΆ1+1β²π1β 2 tr {π(π½β²π11β²π½)}. (7.51c)
All the terms in red are constants and can be precomputed and stored for easy access.e point is that there is no need to construct the π Γ π matrix Ξβ1 in order to computetr {Ξβ1ππ}; we can work with the π Γ π matrix π instead.
Nextwe showhow to compute efficiently the generalizeddeterminantDet {βΞβ1π}.Since Ξ β π»π is conditionally negative definite (and nonnegative),
Det {Ξ£β1π} = (1β²1β²)/(1β²Ξ£β11)det {Ξ£}
and
det {Ξ£} = (1 β 1β²ββ11/π) det { β πβ}
1β²Ξ£β11 = (1 β π/1β²ββ11)β1
Det { β Ξβ1π} = (1β²1)/(1β²Ξβ11)β det { β Ξ} . (7.52)
Furthermore, Ξ has one positive eigenvalue and π β 1 negative eigenvalues, as we showin Appendix 7.3. erefore, β det { β Ξ} is guaranteed to be positive and it is sufficientto compute β£ det {Ξ}β£, or equivalently, find the eigenvalues of Ξ. Since Ξ = π½π΅π½β² β πΌ,
eig {Ξ} = eig {π½π΅π½β²} β 1, (7.53)
where π½π΅π½β² is a block matrix and thus it has π nontrivial eigenvalues besides 0, whichhas multiplicity π β π. Furthermore, for any vector π£ β βπ,
Ξ(π½π£) = π½π΅πΆπ£ β π½π£ = π½(π΅πΆ β πΌ)πΆπ£. (7.54)
erefore, if (π£, π) is an eigenpair for π΅πΆ β πΌ, then (π½π£, π) is an eigenpair for Ξ. at is,the π nontrivial eigenvalues of Ξ are equal to the eigenvalues of π΅πΆ β πΌ.
7.6 Markov chain Monte Carlo
Expected coalescence times in a stepping-stone model are determined by the migra-tion rates between demes and the coalescence rates within demes according to equation(2.15). roughout, we assume that the coalescence rate is the same for all demes andmigration is symmetric. e approximation in terms of effective resistances on an undi-rected graph given by equation (2.28) makes the symmetry assumption explicitly andthe equal size assumption implicitly. To use either expression for computing effectivedistances, we need to specify a migration rate for each undirected edge (πΌ, π½) in the grid(π, πΈ). We assume that the migration rates are piecewise constant and we model them
68
in terms of a colored Voronoi tiling π± of the habitat β . Under the tessellation π± , eachtile has its own migration log rate and all edges within a tile share this parameter.
Since the spatial structure of the population is unknown, an appropriate Voronoitessellation of the habitat must be estimated given the data. We use a version of themethod based on colored Voronoi tessellations implemented in GENELAND [Guillotet al., 2005]. e main difference is that the ''colors'' in GENELAND are cluster indices;in our framework the ''colors'' are log (base 10)migration rates as edges within the sametile share a common rate to encourage locally smooth migration surfaces.
7.6.1 Updating the number of tiles π with birth/death moves
Unlike the log rates and locations of tiles, thenumber of tiles present a transdimensionalinference problem because adding or removing a tile changes the dimensionality of theparameter space. For such a problem we can use the birth-death Markov chain MonteCarlo algorithm (BD-MCMC) which has been applied to other variable dimension prob-lems such as a mixture model with unknown number of components [Stephens, 2000,van Lieshout, 2000].
Assume that the Markov chain is currently in state (π‘, Ξπ‘) with π‘ Voronoi tiles andparameters Ξπ‘, and that there are two options for the next move: with probability π(π‘)the proposed move is (π‘ + 1, Ξπ‘+1), i.e., the birth of a tile; with probability 1 β π(π‘) theproposedmove is (π‘β1, Ξπ‘β1), i.e., the death of a tile. Since we consider only these twomoves, we assume that they occur with equal probability: π(1) = 1A death event is impossible with only one
tile.and π(π‘) = 1βπ(π‘) =
12 for π‘ > 1. For a given number of tiles π‘, the model parameters include the migrationlog rates {βπ1, β¦ , βππ‘} and locations {π’1, β¦ , π’π‘}, as well as common parameters π thatdo not depend on the tessellation and are not updated during a birth/death move. Let
Ξπ‘ = (βπ1, β¦ , βππ‘, π’1, β¦ , π’π‘, π). (7.55)
A full Bayesian model for the Voronoi tiling is specified by the likelihood on pairwisedistances (4.4) together with the following prior distributions for the number of tilesand tile-specific parameters:
π | π βΌ Po(π), (7.56a)
π’ | π iidβΌ U(β), (7.56b)
βπ | π, π iidβΌ N(β sπ, π2π). (7.56c)
where π is the number of tiles, (π’, βπ) are the tile centers and log rates, respectively,and π = (β sπ, π2π) are hyperparameters: the mean log rate β sπ and the variance π2π .e intensity (Poisson rate) π controls the spatial organization. is prior specificationimplies that rates and locations are a priori independent.
It is convenient to denote the component parameters (location and log rate) byππ‘ =(π’π‘, βππ‘). Since the tiles are not ordered,
π(π, π’, βπ | π, π) β‘ π(π, π1, β¦ , ππ | π, π) (7.57a)
= π(π | π) Γ π! Γ π(π1 |π) β― π(ππ |π) (7.57b)
at is, conditional on the number of tiles π, the ππ‘s are independent and identicallydistributed from a product distribution with density
π(π | π) β π{π(1) β β} β N(π(2) ; β sπ, π2π). (7.58)
inferring effective migration from geographically indexed genetic data 69
Note that the prior is invariant under relabeling of the tiles, i.e.,
π(π, π1, β¦ , ππ | π, π) = π(π, ππ(1), β¦ , ππ(π) | π, π) (7.59)
for every permutation π of the indices 1, β¦ , π. at is, the tile parameters are ex-changeable.
Next we construct a birth-death MCMC that allows only two types of moves: thebirth of a new tile and the death of an existing tile (when π > 1). Suppose that thecurrent state is (π‘, π1, β¦ , ππ‘). If the proposal is a birth, we add a new tile with lograte βππ‘+1 βΌ N(β sπ, π2π) and location π’π‘+1 βΌ U(β). We denote the birth density byπ(π‘) = π(ππ‘+1 |π). If the proposal is a death, we select a tile to remove uniformly atrandom, i.e., with probability π(π‘) = 1
π‘ .To guarantee that the birth-death chain is reversible and the stationary distribu-
tion is the posterior π(π, π1, β¦ , ππ | π§, π, π, π) given observed data π§, we choose theacceptance probabilities πΌ(β , β ) so that they satisfy the detailed balance condition:
π(π‘)π(π‘)π(π‘, π1, β¦ , ππ‘ | π§, π, π, π)πΌ(π‘, π‘ + 1)= [1 β π(π‘ + 1)]π(π‘ + 1)π(π‘ + 1, π1, β¦ , ππ‘+1 | π§, π, π, π)πΌ(π‘ + 1, π‘). (7.60)
Since π(π‘) = π(π‘ + 1) = 12 ,
π(π‘) = πΌ(π‘, π‘ + 1)πΌ(π‘ + 1, π‘) = π(π‘ + 1)
π(π‘)π(π‘ + 1, π1, β¦ , ππ‘+1 | π§, π, π, π)
π(π‘, π1, β¦ , ππ‘ | π§, π, π, π) (7.61a)
= π(π‘ + 1)π(π‘)
π(π‘ + 1 | π)π(π‘ | π)
π(π1, β¦ , ππ‘+1 | π‘ + 1, π)π(π1, β¦ , ππ‘ | π‘, π)
ππ‘+1(π§ ; ππ‘+1, π)ππ‘(π§ ; ππ‘, π) (7.61b)
= ππ‘ + 1
ππ‘+1(π§ ; ππ‘+1, π)ππ‘(π§ ; ππ‘, π) Apply equation (7.57) with π(π‘ | π) =
ππ‘πβπ/π‘!(7.61c)
and πΌ(π‘, π‘ + 1) = min {π(π‘), 1}. erefore, the following algorithm simulates a Markovchain with stationary distributionπ(π, ππ | π§, π, π, π)where, for simplicity of notation,we write ππ‘ = (π1, β¦ , ππ‘).1. Choose between a birth event and a death event, with equal probability.
2. If a birth is proposed, its location, migration log rate and coalescence log rate aresampled from the priors, and the acceptance probability is
πΌ(π, π + 1) = min { ππ + 1
ππ+1(π§ ; Ξπ+1)ππ(π§ ; Ξπ) , 1} (7.62)
3. If a death is proposed, a tile to be removed is selected uniformly at random, and theacceptance probability is
πΌ(π + 1, π) = min {π + 1π
ππ(π§ ; Ξπ)ππ+1(π§ ; Ξπ+1) , 1} (7.63)
because a deletionmove is the reverse of an additionmove [Byers andRaftery, 2002].
7.6.2 Updating the Voronoi centers (for a fixed number of tiles π)
is is a Metropolis-Hastings symmetric random-walk update. Sequentially, for eachtile π‘, we propose a new centerπ’β
π‘ . eproposal distribution is bivariate normal centeredat the current value π’π
π‘ = (π₯π‘, π¦π‘) [with correlation 0]. e proposal is accepted withprobability
πΌ = min {π(π’β
π‘ |π, Ξ\π’π‘)π(π’π
π‘ |π, Ξ\π’π‘), 1} = min {π (π ; Ξβ)π(π’β)
π (π ; Ξπ)π(π’π) , 1}. (7.64)
70
e prior distribution of the center locations π’ = (π’π‘ βΆ π‘ = 1, β¦ , π) is uniform overthe domain β ,
π(π’) β π {π’π‘ β β βΆ π‘ = 1, β¦ , π} . (7.65)
On the log scale, log(πΌ) = ββ if at least one component of π’β falls outside of thedomain β . Otherwise,
log(πΌ) = min {β(Ξβ |π) β β(Ξπ |π), 0}. (7.66)
7.6.3 Updating the log-transformed migration rates βπWeassume that themigration log (base 10) rates are normally distributedwith commonmean β sπ and variance π2π ,
βππ‘ | β sπ, π2πiidβΌ N(β sπ, π2π), (7.67)
or in an equivalent parametrization,
βππ‘ = β sπ + ππ‘, ππ‘iidβΌ N(0, π2π). (7.68)
where β sπ is the mean log rate and ππ‘ is the effect of tile π‘, relative to the mean. esecond parametrization is more convenient because it allows scaling all migration ratessimultaneously by adjusting β sπ.
We choose a vague prior for the hyperparameters β sπ, π2π assuming prior indepen-dence of location and scale,
β sπ βΌ U(πππ, π’ππ), (7.69)
π2π βΌ Inv-G( π2 , π
2). (7.70)
at is, the hyperprior on (β sπ, π2π) is is semi-conjugate.To simulate a Markov chain with stationary distribution π(π, π’, βπ | π§, π, π),
1. Update each error in turn (or all errors at once) with a Metropolis-Hastings stepand a random-walk proposal. at is, we draw a new migration log rate parameterβπβ
π‘ βΌ N(βπππ‘ , ) for each tile in the current Voronoi decomposition and accept the
proposal βπβ = {βπβπ‘ βΆ π‘ = 1, β¦ , π} with probability
πΌ = min {π (π ; Ξ\βπ, βπβ)π(βπβ | β sπ, π2π)π (π ; Ξ\βπ, βππ)π(βππ | β sπ, π2π) , 1}. (7.71)
2. Update themeanmigration log rate β sπ with aMetropolis-Hastings step and a random-walk proposal.
3. Update the common log rate variance π2π with a Gibbs step by sampling from its fullconditional distribution:
π(π2π |π, Ξ) β Inv-G( π2 , π
2)π
βπ‘=1
N(ππ‘ ; 0, π2π) (7.72a)
β {1
π2π}
π/2+1exp { β
π
2π2π} Γ
πβπ‘=1
{1
π2π}
1/2exp { β
π2π‘
2π2π} (7.72b)
β {1
π2π}
π/2+π/2+1exp { β
1
2π2π(π + π 2π)}, (7.72c)
inferring effective migration from geographically indexed genetic data 71
where π 2π = βππ‘=1 π2
π‘ is the sum of squares for the relative tile effects on the log scale.Because we conveniently choose the conjugate inverse-gamma prior for π2π , we canupdate this parameter by drawing
π2π βΌ Inv-G((π + π)/2, (π + π 2π /2)). (7.73)
7.6.4 Updating the degrees of freedom πHere we consider updating the degrees of freedom π. e proposal distribution is
πβ βΌ N(ππ, π£πππ€) (7.74)
where ππ is the current value and π£ is the proposal variance.Since the Wishart degrees of freedom for a π Γ π matrix is a real number π that
satisfies π > π β 1, the support of this parameter is (π, π). If the proposed value πβ isnot valid,
log {π(πβ)π(ππ) } = ββ (7.75)
and the proposal is rejected. Otherwise, it is accepted with probability
πΌ = min {π(πβ)π (π ; Ξ\π, πβ)π(ππ)π (π ; Ξ\π, ππ) , 1} (7.76)
Here π (π ; Ξ\π, πβ) is the likelihood for the given value of π with the rest of the param-eters Ξ\π fixed to their current values. e prior on the degrees of freedom is uniformon the log scale, i.e.,
π(π) β 1π . (7.77)
Since π is bounded, the prior is proper with normalizing constant log(π) β log(π).
7.6.5 Updating the scale nuisance parameter πβ
e nuisance parameter πβ = ππ2 can be efficiently updated with a Gibbs step if wechoose the conjugate prior distribution, Inv-G(π/2, π/2). en the full conditional isalso Inverse Gamma with shape and scale parameters given by
πβ = π + π(π β 1), (7.78a)
πβ = π + π tr {Ξβ1ππ}. (7.78b)
For microsatellites, π(πβ1 , β¦ , πβπ |π, Ξ) factorizes into the full conditional of each site-
specific scale parameter πβπ , so there is no loss of efficiency to estimate a small numberof microsatellites.
7.7 MATLAB implementation
7.7.1 Triangular (isometric) grid
Suppose that the genotypes individuals are sampledwithin a rectangular regionβ By convention, π₯ denotes longitudes andπ¦ latitudes.
boundedby (π₯0, π¦0) on the bottom right and (π₯1, π¦1) on the top right.
To initialize the program,we specify the dimensions βπ₯Γβπ¦ By definition, a triangular grid is formedby dividing the plane regularly intoequilateral triangles.
of a triangular grid (π, πΈ)to tile the habitat β . e resulting grid is regular but not strictly isometric, unless βπ₯and βπ¦ are chosen to match the size of the habitat.
72
...π . πΈ.
ππ
.
ππΈ
.
ππ
.
ππΈ
7.7.2 Data structures
Here I describe the MATLAB implementation and data structures. e problem is spec-ified in terms of
β’ βπ₯ Γ βπ¦ triangular grid (π, πΈ) which spans the habitat β ;
β’ (βπ₯βπ¦) Γ (βπ₯βπ¦) symmetric matrix π of migration rates.
e order of (π, πΈ) is |π| = βπ₯βπ¦ and the size is ππ β‘ |πΈ| = (βπ₯ β1)βπ¦ +(2βπ₯ β1)(βπ¦ β1).Both the grid (π, πΈ) and the migration matrix π are very sparse because each vertexπ£ β π has at most six neighbors and
π = (ππΌπ½) =β§{{β¨{{β©
ππΌπ½ if (πΌ, π½) β πΈ0 otherwise.
β«}}β¬}}β
ππΌπ½ = 2π0οΏ½ΜοΏ½πΌπ½ where π0 is the
coalescent timescale.
(7.79)
at is, (π, πΈ) and π together describe a weighted matrix πΊ = (π, πΈ, π). It is notrequired that π be symmetric; the linear system for Ξ is valid as long as (π, πΈ) is con-nected: If all demes communicate, the sample will eventually coalesce, i.e., the distancebetween lineages is finite. is guarantees that
Ξ = (π2πΌπ½) = {π2
πΌπ½ < β for (πΌ, π½) β π Γ π.} (7.80)
Although the twomatrices have the same size,π is sparse butΞ is full and hencemightbe expensive to compute. With a denser grid (π, πΈ), few of the demes are sampled fromand computing the entire distancematrixΞ is not necessary. To compute the likelihoodof the data, we need only the sample distance matrix Ξ.
In the rest of this section, let ππ£ = βπ₯βπ¦ be the number of demes and ππ = (ππ£2 ) =
ππ£(ππ£ β 1)/2 be the number of unique pairs of demes. e number of unknowns isππ£ + ππ, the number of within-deme coalescence times plus the number of between-demes coalescence times.Vertex set representation: e vertices π are stored in a ππ£ Γ 2 matrix Vcoord, withthe π₯ (longitude) coordinates in the first column and the π¦ (latitude) coordinates in thesecond column. e locations of the Voronoi sites π are stored similarly in Scoord.e triangular grid (π, πΈ) is fixed, so Vcoord does not change. On the other hand, theVoronoi decomposition ofβ is updated regularly, which is reflected by (row) changes inScoord.
e two matrices are used to update the Voronoi tessellation whenever a tile movesits location. Recall that by definition the Voronoi tile (cell) π(π ) consists of the pointscloser to π than to any other site.
euDist = rdist(Vcoord,Scoord);Compute all distances between the demesin Vcoord and the sites in Scoord.
[temp,Colors] = min(euDist,[],2);For each deme π£ β π, find the closestVoronoi site π β π.
e vector Colors indicates which tile each deme falls into.Edge set representation e edges πΈ are stored in a ππ£ Γ 6 matrix Edges. ere is onerow for each vertex (deme) and the columns are its six adjacent vertices, in the orderπ,ππ, ππΈ, πΈ, ππΈ, ππ (clockwise). Vertices are identified by their row index in Edges.If the deme does not have a neighbor in some positions, the corresponding entries ofEdges are set to 0. e number of nonzero entries is twice the number of edges 2ππ.Rate parameters representation e backward migration matrix is stored in a ππ£ Γ ππ£sparse matrix Mrates with 2ππ nonzero elements.
inferring effective migration from geographically indexed genetic data 73
7.7.3 Computing coalescence distances
Our MCMC implementation requires repeatedly solving a system of linear equationsπ΄π₯ = π. e matrix π΄ = [π΄1; π΄2] is large, sparse, nearly symmetric and positive defi-nite. e regularity of the grid πΊ gives π΄ its structure and sparseness.
π΄1 represents the ππ£ within-deme equations
(ππΌ + ππΌ)ππΌπΌ β βπΎβπππ(πΌ)
ππΌπΎππΌπΎ = 1, (7.81)
and π΄2 represents the ππ between-demes equations
(ππΌ + ππ½)ππΌπ½ β βπΎβπππ(πΌ)
ππΌπΎππ½πΎ β βπΎβπππ(π½)
ππ½πΎππΌπΎ = 2. (7.82)
Here πππ(πΌ) = {πΎ β π βΆ (πΌ, πΎ) β πΈ} is the set of vertices adjacent to πΌ and ππΌ =βπΎβπππ(πΌ) ππΌπΎ is the rate of migration into πΌ. e equations also shows that π =[1ππ£ ; 1ππ].
e matrix π΄ is positive definite because
π΄2 = β2({ππΌπ½}), (7.83)
π΄1 = diag {π} + β1({ππΌπ½}). (7.84)
e Laplacian matrices β1, β2 are functions of only the migration rates and β =[β1; β2] is also a Laplacian matrix, and therefore, it is positive definite. We note thatthe matrix π¬ = ββ is the infinitesimal generator the migration process where thelineages move from deme to deme according to π. For a continuous-time stochasticprocess, the infinitesimal generator is the matrix π¬ = (ππ₯,π¦) with entries
ππ₯,π¦ =β§{β¨{β©
βππ₯ if π₯ = π¦,ππ₯ππ₯,π¦ otherwise
(7.85)
where ππ₯ is the holding rate for state π₯ and π = (ππ₯,π¦) is the transition probabilitymatrix of the embedded jump chain. In this case, the transition probabilities are
ππΌπ½βπΎβπππ(πΌ) ππΌπΎ
=π0οΏ½ΜοΏ½πΌπ½
βπΎβπππ(πΌ) π0οΏ½ΜοΏ½πΌπΎ=
οΏ½ΜοΏ½πΌπ½1 β οΏ½ΜοΏ½πΌπΌ
. (7.86)
Solving π΄π₯ = π, and thus finding all coalescence times at once, has the advantage ofreducing numerical errors. Because we use an iterative procedure (preconditioned con-jugate gradient), we control how close the approximate solution π₯ is to the true solutionπ₯. If we first solve for π₯2 and then substitute to find π₯1, numerical errors in π₯2 are prop-agated in π₯1.
7.7.4 Computing resistance distances
Consider again the matrix of migration rates between neighboring demes, π. Let πΏ beits Laplacian matrix,
πΏ = diag {π1} β π (7.87)
e effective resistance π πΌπ½ between a pair of demes (πΌ, π½) is equal to the π½th elementof the vector π₯ given by
πΏβπΌπ₯ = ππ½ (7.88)
74
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.1: is is a5Γ4 regular triangulargrid. If line weight is proportional to mi-gration rate, this pattern corresponds touniform migration with equal deme sizes.
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.2: Barrier to migration. Ifline weights are proportional to migrationrates, this patterns corresponds to a bar-rier across the middle of the habitat.
where πΏβπΌ is the Laplacian πΏ with the πΌth row and column removed, and ππ½ is the stan-dard basis vector with a 1 at the π½th coordinate and 0s elsewhere. is method can beoptimized by solving πΏβπΌπ = πΈ where πΈ = (ππ½), so that we compute the effectiveresistance between πΌ and all other demes with a single matrix operation.
ere are other methods to compute π , e.g., with a single matrix inversion [BabiΔet al., 2002]
π» = (πΏ + πβ1π£ π½)β1 (7.89)
π πΌπ½ = π»πΌπΌ + π»π½π½ β 2π»πΌπ½ (7.90)
is method is more efficient for sparser grids where most demes are sampled from.
7.8 Simulations with ms
Here we describe how we produce samples from a spatially distributed population thatevolves under Kimura's stepping-stone model using the program ms [Hudson, 2002].For all simulations, we first construct a regular triangular grid (π, πΈ) of βπ₯ Γ βπ¦ demes(vertices) with coordinates (π₯πΌ, π¦πΌ). e spatial information is not explicitly used by msand instead we set ππΌπ½ = 0 = ππ½πΌ for all pairs of demes such that (πΌ, π½) β πΈ. Wealso specify the size of each deme, ππΌ, and the migration rates across each edge in thetwo opposite directions, ππΌπ½ and ππ½πΌ. Migration is not necessarily symmetric but isconservative [i.e., it preserves deme sizes.] Both the deme sizes ππΌ = π0ππΌ and themigration rates ππΌπ½ = 4π0οΏ½ΜοΏ½πΌπ½ are relative to the coalescent timescale π0. [at is, ππΌis the relative size of deme πΌ and οΏ½ΜοΏ½πΌπ½ is the backward migration fraction from πΌ to π½per generation.] e input arguments are ππΌ and ππΌπ½, respectively.
7.8.1 Spatial structure due to constant migration
Here edges in themiddle of the habitat havemigration rate that is a factor ofmagnitudelower than the rate of the edges on either side. [e rates range from 0.3 to 3.] ispatterns is a barrier to actual migration and it results in a barrier to effective migration.
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-m 1 2 3.0 -m 2 1 3.0 -m 1 6 3.0 -m 6 1 3.0 -m 2 3 1.7 -m 3 2 1.7
-m 2 6 3.0 -m 6 2 3.0 -m 2 7 1.7 -m 7 2 1.7 -m 3 4 0.3 -m 4 3 0.3
-m 3 7 0.3 -m 7 3 0.3 -m 3 8 0.3 -m 8 3 0.3 -m 4 5 1.7 -m 5 4 1.7
-m 4 8 0.3 -m 8 4 0.3 -m 4 9 1.7 -m 9 4 1.7 -m 5 9 3.0 -m 9 5 3.0
-m 5 10 3.0 -m 10 5 3.0 -m 6 7 1.7 -m 7 6 1.7 -m 6 11 3.0 -m 11 6 3.0
-m 6 12 3.0 -m 12 6 3.0 -m 7 8 0.3 -m 8 7 0.3 -m 7 12 1.7 -m 12 7 1.7
-m 7 13 0.3 -m 13 7 0.3 -m 8 9 1.7 -m 9 8 1.7 -m 8 13 0.3 -m 13 8 0.3
-m 8 14 0.3 -m 14 8 0.3 -m 9 10 3.0 -m 10 9 3.0 -m 9 14 1.7
-m 14 9 1.7 -m 9 15 3.0 -m 15 9 3.0 -m 10 15 3.0 -m 15 10 3.0
-m 11 12 3.0 -m 12 11 3.0 -m 11 16 3.0 -m 16 11 3.0 -m 12 13 1.7
-m 13 12 1.7 -m 12 16 3.0 -m 16 12 3.0 -m 12 17 1.7 -m 17 12 1.7
-m 13 14 0.3 -m 14 13 0.3 -m 13 17 0.3 -m 17 13 0.3 -m 13 18 0.3
-m 18 13 0.3 -m 14 15 1.7 -m 15 14 1.7 -m 14 18 0.3 -m 18 14 0.3
-m 14 19 1.7 -m 19 14 1.7 -m 15 19 3.0 -m 19 15 3.0 -m 15 20 3.0
-m 20 15 3.0 -m 16 17 1.7 -m 17 16 1.7 -m 17 18 0.3 -m 18 17 0.3
-m 18 19 1.7 -m 19 18 1.7 -m 19 20 3.0 -m 20 19 3.0
inferring effective migration from geographically indexed genetic data 75
..1. 2. 3. 4. 5.
6.
7
.
8
.
9
.
10.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.3: Barrier to effective migrationdue to differences in effective populationsize. e demes in bold are 5 times bigger;the edges in red are directedβ this is nec-essary to preserve equilibrium in time.
..1. 2. 3. 4. 5.
6.
7
.
8
.
9
.
10.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.4: Uniform effective migrationeven though there are differences in bothpopulation size and in migration rates.e demes in bold are 4 times bigger; theedges in red are directed β this is neces-sary to preserve equilibrium in time.
7.8.2 Spatial structure due to variation in diversity
Here some demes have bigger size and thus lower coalescence rate and higher geneticdiversity. In the first version, migration rates are constant but there are differencesin effective population size. Since demes in the ''east'' and ''west'' of the habitat are 5times bigger than those in the middle, the effect is a barrier to effective migration thatis qualitatively very similar to the true barrier in the previous simulation.
[A few edges are directed, with rate ππΌπ½ = 0.2 from a big deme to a small demeand rate ππ½πΌ = 1 in the other direction. ese edges cross the ''boundary'' betweenthe areas of high and low diversity and their rates are assigned so that migration isconservative: the same number of migrants are exchanged between πΌ and π½ becauseππΌππΌπ½ = ππ½ππ½πΌ.]
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0
-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0
-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0
-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 0.2 -m 3 2 1.0
-m 2 6 1.0 -m 6 2 1.0 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0
-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0
-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0
-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0
-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0
-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0
-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0
-m 9 15 1.0 -m 15 9 0.2 -m 10 15 1.0 -m 15 10 1.0 -m 11 12 0.2
-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0
-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0
-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0
-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0
-m 19 14 0.2 -m 15 19 1.0 -m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0
-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0
-m 19 18 0.2 -m 19 20 1.0 -m 20 19 1.0
In the second version, differences in migration rates compensate for differences indeme size because ππΎππΎπ = πππππΎ for all edges (πΎ, π) β πΈ. e result is no varia-tion in effective migration although both the deme sizes and the migration rates varyacross the habitat.
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0
-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0
-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0
-m 1 2 0.2 -m 2 1 0.2 -m 1 6 0.2 -m 6 1 0.2 -m 2 3 0.2 -m 3 2 1.0
-m 2 6 0.2 -m 6 2 0.2 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0
-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0
-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0
-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 0.2 -m 11 6 0.2
-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0
-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0
-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0
76
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.5: Barrier to effective migrationdue to a split in time and otherwise uni-form migration rates. e dashed edgesare disconnected at the same time in thepast.
-m 9 15 1.0 -m 15 9 0.2 -m 10 15 0.2 -m 15 10 0.2 -m 11 12 0.2
-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0
-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0
-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0
-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0
-m 19 14 0.2 -m 15 19 0.2 -m 19 15 0.2 -m 15 20 0.2 -m 20 15 0.2
-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0
-m 19 18 0.2 -m 19 20 0.2 -m 20 19 0.2
7.8.3 Spatial structure due to a split event
Here the effect of a barrier to effective migration is produced by a past event that ze-roes out somemigration rates and thus disconnects the ''east'' and ''west'' regions of thehabitat. e split is instantaneous and occurs 3π0 generations back in the past. iscreates a barrier in time that is detected as a barrier to effective migration.
ms 20 1 -s 1 -I 20 4 3 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 3 4 0
-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 1.0 -m 3 2 1.0
-m 2 6 1.0 -m 6 2 1.0 -m 2 7 1.0 -m 7 2 1.0 -m 5 10 1.0 -m 10 5 1.0
-m 6 7 1.0 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0 -m 6 12 1.0 -m 12 6 1.0
-m 9 10 1.0 -m 10 9 1.0 -m 9 15 1.0 -m 15 9 1.0 -m 10 15 1.0
-m 15 10 1.0 -m 11 12 1.0 -m 12 11 1.0 -m 11 16 1.0 -m 16 11 1.0
-m 14 15 1.0 -m 15 14 1.0 -m 14 19 1.0 -m 19 14 1.0 -m 15 19 1.0
-m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0 -m 18 19 1.0 -m 19 18 1.0
-m 19 20 1.0 -m 20 19 1.0
-em 3.0 3 7 1.0 -em 3.0 3 8 1.0 -em 3.0 3 4 1.0 -em 3.0 4 3 1.0
-em 3.0 4 8 1.0 -em 3.0 4 9 1.0 -em 3.0 4 5 1.0 -em 3.0 5 4 1.0
-em 3.0 5 9 1.0 -em 3.0 7 12 1.0 -em 3.0 7 13 1.0 -em 3.0 7 8 1.0
-em 3.0 7 3 1.0 -em 3.0 8 7 1.0 -em 3.0 8 13 1.0 -em 3.0 8 14 1.0
-em 3.0 8 9 1.0 -em 3.0 8 4 1.0 -em 3.0 8 3 1.0 -em 3.0 9 8 1.0
-em 3.0 9 14 1.0 -em 3.0 9 5 1.0 -em 3.0 9 4 1.0 -em 3.0 12 16 1.0
-em 3.0 12 17 1.0 -em 3.0 12 13 1.0 -em 3.0 12 7 1.0 -em 3.0 13 12 1.0
-em 3.0 13 17 1.0 -em 3.0 13 18 1.0 -em 3.0 13 14 1.0 -em 3.0 13 8 1.0
-em 3.0 13 7 1.0 -em 3.0 14 13 1.0 -em 3.0 14 18 1.0 -em 3.0 14 9 1.0
-em 3.0 14 8 1.0 -em 3.0 16 17 1.0 -em 3.0 16 12 1.0 -em 3.0 17 16 1.0
-em 3.0 17 18 1.0 -em 3.0 17 13 1.0 -em 3.0 17 12 1.0 -em 3.0 18 17 1.0
-em 3.0 18 14 1.0 -em 3.0 18 13 1.0
8
Bibliography
D. BabiΔ, D. J. Klein, I. Lukovits, S. NikoliΔ, and N. TrinajstiΔ. Resistance-distance ma-trix: A computational algorithm and its application. International Journal of QuantumChemistry, 90(1):166β176, 2002.
M. Bahlo and R. C. Griffiths. Coalescence time for two genes from a subdivided popu-lation. Journal of Mathematical Biology, 43(5):397β410, 2001.
R. B. Bapat. Resistance matrix of a weighted graph. MATCH: Communications inMath-ematical and in Computer Chemistry, 50:73β82, 2004.
R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. CambridgeUniversity Press, 1997.
P. Beerli and J. Felsenstein. Maximum likelihood estimation of amigrationmatrix andeffective population sizes in π subpopulations by using a coalescent approach. Proceed-ings of the National Academy of Sciences (PNAS), 98(8):4563β4568, 2001.
C. A. Brewer, G. W. Hatchard, and M. A. Harrower. ColorBrewer in print: a catalog ofcolor schemes for maps. Cartography and Geographic Information Science, 30(1):5β32,2003.
S. D. Byers and A. E. Raftery. Bayesian estimation and segmentation of spatial pointprocesses using Voronoi tilings. In Andrew B. Lawson andDavid G.T. Denison, editors,Spatial Cluster Modeling, page 109β121. Chapman&Hall, 2002.
L. L. Cavalli-Sforza, P.Menozzi, andA. Piazza.ehistory and geography of humangenes.Princeton University Press, 1994.
A. K. Chandra, P. Raghavan, W. L. Ruzzo, R. Smolensky, and P. Tiwari. e electricalresistance of a graph captures its commute and cover times. Computational Complexity,6(4):312β340, 1996.
A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, and R. Nielsen. Ascer-tainment bias in studies of human genome-wide polymorphism. Genome Research, 15(11):1496β1502, 2005.
C. C. Cockerham. Variance of gene frequencies. Evolution, 23(1):72β84, 1969.
J. Felsenstein. A pain in the torus: Some difficulties with models of isolation by dis-tance. e American Naturalist, 109(967):359β368, 1975.
J. C. Gower. Properties of Euclidean and non-Euclidean distance matrices. LinearAlgebra and its Applications, 67(1):81β97, 1985.
78
G. Guillot, A. Estoup, F. Mortier, and J. F. Cosson. A spatial statistical model for land-scape genetics. Genetics, 170(3):1261β1280, 2005.
E.M.Hanks andM.B.Hooten. Circuit theory andmodel-based inference for landscapeconnectivity. Journal of the American Statistical Association, 108(501):22β33, 2013.
J. Hey. A multi-dimensional coalescent process applied to multi-allelic selection mod-els and migration models. eoretical Population Biology, 39(1):30β48, 1991.
MW. Horton, A. M. Hancock, Y. S. Huang, C. Toomajian, S. Atwell, and et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions fromthe RegMap panel. Nature Genetics, 44(2):212β216, 2012.
M. J. Hubisz, D. Falush, M. Stephens, and J. K. Pritchard. Inferring weak popula-tion structure with the assistance of sample group information. Molecular Ecology Re-sources, 9(5):1322β1332, 2009.
R. R. Hudson. Gene genealogies and the coalescent process. In Douglas Futuyma andJanis Antonovics, editors,Oxford surveys in evolutionary biology, volume 7, pages 1--44.Oxford University Press, 1990.
R. R. Hudson. Generating samples under a Wright-Fisher neutral model of geneticvariation. Bioinformatics, 18(2):337β338, 2002.
M. Kimura. e number of heterozygous nucleotide sites maintained in a finite popu-lation due to steady flux of mutations. Genetics, 61(4):893β903, 1969.
M. Kimura and G. H.Weiss. e stepping stonemodel of population structure and thedecrease of genetic correlation with distance. Genetics, 49(4):561β576, 1964.
J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability,19(A):27β43, 1982a.
J. F. C. Kingman. e coalescent. Stochastic Processes and their Applications, 13(3):235β248, 1982b.
O. Lao, T. T. Lu, M. Nothnagel, O. Junge, S. Freitag-Wolf, A. Caliebe, and et al. Cor-relation between genetic and geographic structure in Europe. Current Biology, 18(16):1241β1248, 2008.
D. J. Lawson andD. Falush. Population identificationusing genetic data. AnnualReviewof Genomics and Human Genetics, 13:337β361, 2012.
J. Y. Lee and S. V. Edwards. Divergence across Australia's Carpentarian barrier: Statis-tical phylogeography of the red-backed fairy wrenMalurusmelanocephalus. Evolution,62(12):3117β3134, 2008.
D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. AmericanMathematical Society, 2008.
P. McCullagh. Marginal likelihood for distancematrices. Statistica Sinica, 19:631β649,2009.
B. H. McRae. Isolation by resistance. Evolution, 60(8):1551β1561, 2006.
inferring effective migration from geographically indexed genetic data 79
B. H. McRae, B. G. Dickson, T. H. Keitt, and V. B. Shah. Using circuit theory to modelconnectivity in ecology, evolution, and conservation. Ecology, 89(10):2712β2742,2008.
G. McVean. A genealogical interpretation of principal components analysis. PLoSGenetics, 5(10):e1000686, 2009.
P.Menozzi, A. Piazza, and L. L. Cavalli-Sforza. Syntheticmaps of human gene frequen-cies in Europeans. Science, 201(4358):786β792, 1978.
T.Nagylaki.e strong-migration limit in geographically structured populations. Jour-nal of Mathematical Biology, 9(2):101β114, 1980.
M.Nei. Analysis of genediversity in subdividedpopulations. Proceedings of theNationalAcademy of Sciences (PNAS), 70(12):3321β3323, 1973.
M. R. Nelson, K. Bryc, K. S. King, A. Indap, A. R. Boyko, J. Novembre, and et al. epopulation reference sample, POPRES: A resource for population, disease, and phar-macological genetics research. e American Journal of Human Genetics, 83(3):347β358, 2008.
R. Nielsen. Estimation of population parameters and recombination rates from singlenucleotide polymorphisms. Genetics, 154(2):931β942, 2000.
M. Nordborg, T. T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, andet al. e pattern of polymorphism in Arabidopsis thaliana. PLoS Biology, 3(7):e196,2005.
M. Notohara. e coalescent and the genealogical process in geographically structuredpopulation. Journal of Mathematical Biology, 29(1):59β75, 1990.
M.Notohara. e strong-migration limit for the genealogical process in geographicallystructured populations. Journal of Mathematical Biology, 31(2):115β122, 1993.
J. Novembre and M. Stephens. Interpreting principal component analyses of spatialpopulation genetic variation. Nature Genetics, 40(5):646β649, 2008.
J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S.King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirrorgeography within Europe. Nature, 456(7218):98β101, 2008.
A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations : concepts andapplications of Voronoi diagrams. Wiley Series in Probability and Statistics. Wiley, 2000.
A. Platt,M.Horton, Y. S.Huang, Y. Li, A. E. Anastasio, and et al. e scale of populationstructure in Arabidopsis thaliana. PLoS Genetics, 6(2):e1000843, 2010.
A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich.Principal components analysis corrects for stratification in genome-wide associationstudies. Nature Genetics, 38(8):904β909, 2006.
A. L. Price, N. A. Zaitlen, D. Reich, and N. Patterson. New approaches to populationstratification in genome-wide association studies. NatureReviewsGenetics, 11(7):459β463, 2010.
J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure usingmultilocus genotype data. Genetics, 155(2):945β959, 2000.
80
N. A. Rosenberg, S. Mahajan, S. Ramachandran, C. Zhao, J. K. Pritchard, and M. W.Feldman. Clines, clusters, and the effect of study design on the inference of humanpopulation structure. PLoS Genetics, 1(6):e70, 2005.
F. Rousset. Genetic differentiation and estimation of gene flow fromF-statistics underisolation by distance. Genetics, 145(4):1219β1228, 1997.
F. Rousset. Genetic structure and selection in subdivided populations. Princeton Univer-sity Press, 2004.
F. Rousset. GENEPOP'007: a complete re-implementation of the GENEPOP softwarefor Windows and Linux. Molecular Ecology Resources, 8:103β106, 2008.
D. Serre and S. PÀÀbo. Evidence for gradients of human genetic diversity within andamong continents. Genome Research, 14(9):1679β1685, 2004.
M. Slatkin. Inbreeding coefficients and coalescence times. Genetical Research, 58(2):167β175, 1991.
M. Stephens. Bayesian analysis of mixture models with an unknown number of com-ponents βan alternative to reversible jump methods. e Annals of Statistics, 28(1):40β74, 2000.
C. Strobeck. Average number of nucleotide differences in a sample from a single sub-population: a test for population subdivision. Genetics, 117(1):149β153, 1987.
C. Tian, R. M. Plenge, M. Ransom, A. Lee, P. Villoslada, C. Selmi, and et al. Analysisand application of European genetic substructure using 300K SNP information. PLoSGenetics, 4(1):e4, 2008.
M. N. M. van Lieshout. Markov point processes and their applications. Imperial CollegePress, 2000.
A. P. Verbyla. A conditional derivation of residual maximum likelihood. AustralianJournal of Statistics, 32(2):227β230, 1990.
C. Wang, Z. A. Szpiech, J. H. Degnan, M. Jakobsson, T. J. Pemberton, J. A. Hardy, A. B.Singleton, andN. A. Rosenberg. Comparing spatialmaps of humanpopulation-geneticvariation using Procrustes analysis. Statistical Applications in Genetics and MolecularBiology, 9(1):Article 13, 2010.
C. Wang, S. ZΓΆllner, and N. A. Rosenberg. A quantitative comparison of the similaritybetween genes and geography in worldwide human populations. PLoS Genetics, 8(8):e1002886, 2012.
S. K. Wasser, A. M. Shedlock, K. Comstock, E. A. Ostrander, B. Mutayoba, andM. Stephens. Assigning African elephant DNA to geographic region of origin: Ap-plications to the ivory trade. Proceedings of the National Academy of Sciences (PNAS), 10(41):14847β14852, 2004.
S. K.Wasser, C.Mailand, R. Booth, B.Mutayoba, E. Kisamo, B. Clark, andM. Stephens.Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban.Proceedings of the National Academy of Sciences (PNAS), 104(10):4228β4233, 2007.
G. H. Weiss and M. Kimura. A mathematical analysis of the stepping stone model ofgenetic correlation. Journal of Applied Probability, 2(1):129β149, 1965.
inferring effective migration from geographically indexed genetic data 81
S. Wright. Isolation by distance. Genetics, 28(2):114β138, 1943.