effects of sample selection bias on the accuracy of...

11
INVESTIGATION Effects of Sample Selection Bias on the Accuracy of Population Structure and Ancestry Inference Suyash Shringarpure* and Eric P. Xing ,1 *Department of Genetics, Stanford University, Stanford, California 94305, and School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213 ABSTRACT Population stratication is an important task in genetic analyses. It provides information about the ancestry of individuals and can be an important confounder in genome-wide association studies. Public genotyping projects have made a large number of datasets available for study. However, practical constraints dictate that of a geographical/ethnic population, only a small number of individuals are genotyped. The resulting data are a sample from the entire population. If the distribution of sample sizes is not representative of the populations being sampled, the accuracy of population stratication analyses of the data could be affected. We attempt to understand the effect of biased sampling on the accuracy of population structure analysis and individual ancestry recovery. We examined two commonly used methods for analyses of such datasets, ADMIXTURE and EIGENSOFT, and found that the accuracy of recovery of population structure is affected to a large extent by the sample used for analysis and how representative it is of the underlying populations. Using simulated data and real genotype data from cattle, we show that sample selection bias can affect the results of population structure analyses. We develop a mathematical framework for sample selection bias in models for population structure and also proposed a correction for sample selection bias using auxiliary information about the sample. We demonstrate that such a correction is effective in practice using simulated and real data. KEYWORDS admixture analysis global ancestry biased sampling population structure Population stratication is an important task in genetic analyses. It provides information about the genetic ancestry of individuals and evolutionary history of populations (Rosenberg et al. 2002) and can be used to correct for confounding effects in genetic association studies (Price et al. 2006). A large number of human genetic datasets such as the HAPMAP (Gibbs 2003), Human Genome Diversity Project (Cavalli-Sforza 2005) along with a smaller number from other organisms are available for study. Datasets that sample a number of individuals from a specic region also have been analyzed to look for evidence of population stratication. These datasets contain individuals from geo- graphically and ethnically diverse populations. Due to practical con- straints, only a small number of individuals from each population are genotyped, and the resulting data are a sample from the entire pop- ulation. This often means that the sample selected for analysis is a bi- ased sample from the underlying populations. This problem is also encountered when multiple datasets are combined to detect popula- tion structure analysis with better resolution. We hypothesize that if the distribution of sample sizes is not representative of the populations being sampled, the accuracy of population stratication analyses of the data could be affected because a fundamental assumption of statistical learning algorithms is that the sample available for analysis is representative of the entire population distribution. Although most algorithms are robust to minor violations of this assumption, sampling bias in the case of genetic datasets may be too large for algorithms to accurately recover stratication. In this work, we develop a mathematical framework for modeling sample selection bias in genotype data. Our experiments on simulated data show that accuracy of population stratication and recovery of individual ancestry are affected to a large extent by the sampling bias in the data collection process. Both likelihood-based methods and eigenanalysis show sensitivity to the effects of sampling bias. We show that sample selection bias can affect population structure analysis of genotype data from cattle. We also propose a mathematical framework Copyright © 2014 Shringarpure and Xing doi: 10.1534/g3.113.007633 Manuscript received May 18, 2013; accepted for publication March 10, 2014; published Early Online March 17, 2014. This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/ by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Supporting information is available online at http://www.g3journal.org/lookup/ suppl/doi:10.1534/g3.113.007633/-/DC1. 1 Corresponding author: 8101 Gates-Hillman Center (GHC), SCS, Carnegie Mellon University Pittsburgh, PA 15213. E-mail: [email protected] Volume 4 | May 2014 | 901

Upload: others

Post on 15-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

INVESTIGATION

Effects of Sample Selection Bias on the Accuracy ofPopulation Structure and Ancestry InferenceSuyash Shringarpure and Eric P Xingdagger1

Department of Genetics Stanford University Stanford California 94305 and daggerSchool of Computer ScienceCarnegie Mellon University Pittsburgh Pennsylvania 15213

ABSTRACT Population stratification is an important task in genetic analyses It provides information aboutthe ancestry of individuals and can be an important confounder in genome-wide association studies Publicgenotyping projects have made a large number of datasets available for study However practicalconstraints dictate that of a geographicalethnic population only a small number of individuals aregenotyped The resulting data are a sample from the entire population If the distribution of sample sizes isnot representative of the populations being sampled the accuracy of population stratification analyses ofthe data could be affected We attempt to understand the effect of biased sampling on the accuracy ofpopulation structure analysis and individual ancestry recovery We examined two commonly used methodsfor analyses of such datasets ADMIXTURE and EIGENSOFT and found that the accuracy of recovery ofpopulation structure is affected to a large extent by the sample used for analysis and how representative it isof the underlying populations Using simulated data and real genotype data from cattle we show thatsample selection bias can affect the results of population structure analyses We develop a mathematicalframework for sample selection bias in models for population structure and also proposed a correction forsample selection bias using auxiliary information about the sample We demonstrate that such a correctionis effective in practice using simulated and real data

KEYWORDS

admixtureanalysis

global ancestrybiased samplingpopulationstructure

Population stratification is an important task in genetic analyses Itprovides information about the genetic ancestry of individuals andevolutionary history of populations (Rosenberg et al 2002) and can beused to correct for confounding effects in genetic association studies(Price et al 2006) A large number of human genetic datasets suchas the HAPMAP (Gibbs 2003) Human Genome Diversity Project(Cavalli-Sforza 2005) along with a smaller number from other organismsare available for study Datasets that sample a number of individualsfrom a specific region also have been analyzed to look for evidence ofpopulation stratification These datasets contain individuals from geo-graphically and ethnically diverse populations Due to practical con-

straints only a small number of individuals from each population aregenotyped and the resulting data are a sample from the entire pop-ulation This often means that the sample selected for analysis is a bi-ased sample from the underlying populations This problem is alsoencountered when multiple datasets are combined to detect popula-tion structure analysis with better resolution

We hypothesize that if the distribution of sample sizes is notrepresentative of the populations being sampled the accuracy ofpopulation stratification analyses of the data could be affected becausea fundamental assumption of statistical learning algorithms is that thesample available for analysis is representative of the entire populationdistribution Although most algorithms are robust to minor violationsof this assumption sampling bias in the case of genetic datasets maybe too large for algorithms to accurately recover stratification

In this work we develop a mathematical framework for modelingsample selection bias in genotype data Our experiments on simulateddata show that accuracy of population stratification and recovery ofindividual ancestry are affected to a large extent by the sampling biasin the data collection process Both likelihood-based methods andeigenanalysis show sensitivity to the effects of sampling bias We showthat sample selection bias can affect population structure analysis ofgenotype data from cattle We also propose a mathematical framework

Copyright copy 2014 Shringarpure and Xingdoi 101534g3113007633Manuscript received May 18 2013 accepted for publication March 10 2014published Early Online March 17 2014This is an open-access article distributed under the terms of the CreativeCommons Attribution Unported License (httpcreativecommonsorglicensesby30) which permits unrestricted use distribution and reproduction in anymedium provided the original work is properly citedSupporting information is available online at httpwwwg3journalorglookupsuppldoi101534g3113007633-DC11Corresponding author 8101 Gates-Hillman Center (GHC) SCS Carnegie MellonUniversity Pittsburgh PA 15213 E-mail epxingcscmuedu

Volume 4 | May 2014 | 901

to correct for sample selection bias in ancestry inference reduceits effects on ancestry estimates We show how such a correctioncan be implemented in practice and demonstrate its effectivenesson simulated and real data

RELATED WORKWe briefly examine methods that can be used for population structureanalysis and the factors that affect their accuracy We also examinerelated work on addressing the problem of sample selection bias indifferent contexts

Methods of population structure analysisA variety of methods have been developed for detecting populationstructure The two main classes of methods used for detectingpopulation structure are model-based methods and eigenanalysisModel-based methods use an explicit admixture model of how thepopulation sample was formed from its ancestral populations TheSTRUCTURE model by Pritchard et al (2000) was one of the earlymethods of this class that is commonly used Extensions to theSTRUCTURE method have been proposed to account for other ob-served evolutionary processes (Falush et al 2003 Huelsenbeck andAndolfatto 2007 Shringarpure and Xing 2009) The frappe method byTang et al (2005) and the ADMIXTURE method by Alexander et al(2009) are alternative ways of solving the optimization problem un-derlying the STRUCTURE model They allow us to efficiently analyzedatasets of large size

The eigenanalysis methods proposed by Price et al (2006) andPatterson et al (2006) project genetic data from individuals intoa low-dimensional space formed from the eigenvectors of the geneticsample Since they do not assume a specific model of populationevolution they can be used in a variety of evolutionary scenarios

Engelhardt and Stephens (2010) showed that the model-basedapproach and the eigenanalysis-based approach to stratification couldboth be interpreted as different ways of factorizing the genotype ma-trix of the given data which suggests that both methods are relateddespite their apparent differences

Factors affecting the accuracy of stratificationA number of factors are known to affect the accuracy of populationstratification and individual ancestry recovery In one of the early workson model-based methods for population stratification Pritchard et al(2000) showed that the number of loci available for analysis had a sig-nificant effect on the recovery of individual ancestry using STRUCTUREKaeuffer et al (2007) studied the effect of linkage disequilibrium onrecovery of population structure using simulated data McVean (2009)suggested an interpretation of the eigenanalysis method that is the basisof the EIGENSOFT method in terms of the coalescence times of indi-viduals They also examined the performance of eigenanalysis in diversedemographic scenarios In the following we discuss the problem ofsample selection bias and some related work about the effect of biasedsampling on accuracy of population stratification

Sample selection biasA common assumption of statistical algorithms is that the available sampleis representative of the underlying population In reality this assumptionmay not always be true Sample selection bias is any systematic differencebetween the sample and the population It affects the internal validity of ananalysis by leading to inaccurate estimation of relationships betweenvariables It also can affect the external validity of an analysis because theresults from a biased sample may not generalize to the population

The problem of sample selection bias was first widely studied ineconometrics where it appeared as a bias among survey respondersHeckman (1979) provided a method of addressing this problem inlinear regression models by estimating the probability of an individualbeing included in the sample Sample selection bias also has beenaddressed in the statistics and machine learning literature by attemptsto understand its effect on classifiers and how estimation and pre-diction can be made correctly in the presence of sampling bias (Vella1998 Zadrozny 2004 Davidson and Zadrozny 2005 Cortes et al2008) Zadrozny (2004) discusses the properties of learning algorithmsand the effect of sample selection bias on their accuracy It also out-lines a possible way of correcting for sample selection bias providedwe know the bias Sample selection bias is also studied in ecologywhen one is trying to model species distributions using presence-onlydata (Phillips et al 2009)

In population genetics sample selection bias could be a seriousproblem because the estimates of ancestry obtained from stratifi-cation analyses often are used to make inferences about the geneticancestry of the sampled individuals The inferred individualancestries also are used as input in correcting for stratification inassociation studies [for instance in ADMIXMAP (Hoggart et al2003)] Pritchard et al (2000) suggest that detecting stratification isdifficult unless a significant number of unmixed individuals fromeach ancestral (or pseudo-ancestral) population is present in thesample This observation was verified by Tang et al (2005) throughexperiments on a small number of simulated datasets To ourknowledge there has not been a systematic study of the effect ofsample selection bias on the accuracy of population structure re-covery and individual ancestry recovery

We propose to study the effect of sample selection bias on theaccuracy of population stratification and individual ancestry recoveryusing a model-based approach (ADMIXTURE) and an eigenanalysis-based approach (EIGENSOFT) Since the analysis of McVean (2009)provides guidelines on the effect of sampling bias on stratificationaccuracy using eigenanalysis we will focus our attention mainly onprobabilistic models such as ADMIXTURE

A MATHEMATICAL FRAMEWORK FOR SAMPLESELECTION BIASWe will consider the problem of studying genotype data usinga probabilistic model Ordinarily for probabilistic modeling we wouldassume that we have genotypes (g) drawn independently from a dis-tribution D (with domain G) over the feature space G We assume thatour points (g u s) are drawn independently from a distribution Dover G middot U middot S where G is the space of genotypes U are someauxiliary features of the data that are not of direct interest for mod-eling and S is a binary space The variable s controls the selection ofpoints (1 means the point is selected and is observed in our sample0 means the point is not selected) Our observed sample contains onlypoints that have s = 1 We will refer to this as the selected sample andrefer to its distribution as D9

We consider the setting where s is independent of g given u that isP(s|g u) = P(s|u) This setting where the selection is controlled byfeatures different from the genotype we want to model arises fre-quently in real applications In population genetics whether an in-dividual is included in a genotyping study often depends on factorssuch as geographical location

Sample selection bias correctionIt is easy to see that if g is independent of u in the previous settingthen sample selection bias has no effect and the probability of g in the

902 | S Shringarpure and E P Xing

selected sample is the same as probability of g under D (asymptotically)If g and u are not independent then we can write using Bayes rule

Pethg uTHORN frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1jg uTHORN (1)

frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1juTHORN (2)

which can be rewritten as

PDethg uTHORN frac14 Peths frac14 1THORNPD9ethg uTHORNPeths frac14 1juTHORN (3)

where D9 represents the distribution of the selected sample Since theterm P(s = 1) is constant with respect to (g u) we can say that

PDethg uTHORN frac14 c middot PD9ethg uTHORNPeths frac14 1juTHORN (4)

where c is a constant that need not be evaluated for tasks such aslearning model parameters Therefore to model PD(g u) accuratelyto a multiplicative constant we can follow the procedure below

1 Compute PD9(g u) using a model learned on the selected sample2 Apply a correction using the term P(s = 1|u) This can be done in

two ways(a) If we know the selection procedure we know P(s = 1|u) and

can directly use it in the model(b) If the selection procedure is not known but we have access to

large number of points for which we know (u s) but not gwe can estimate P(s = 1|u) In the population genetics exam-ple this would correspond to knowing the language or geo-graphical region of an individual and whether or not theycould have been included in the study (since genotypingindividuals to find g for a large number of individuals wouldbe expensive)

However this analysis which can accurately correct for sampleselection bias in the described setting requires a model of both g andu In most applications we are interested in only modeling g and notu For instance although there is interest in modeling the distributionof genotypes distributions of language or geography are not of directinterest in genetics Therefore we consider a similar analysis in thecase where we only model P(g) and attempt to derive a correction forsample selection bias

Approximate correctionWe consider the case when we only want to model P(g) Proceeding ina similar way as before we can write

PethgTHORN frac14 Peths frac14 1THORNPethgjs frac14 1THORNPeths frac14 1jgTHORN (5)

which can be restated as

PDethgTHORN frac14 c middot PD9ethgTHORNPeths frac14 1jgTHORN (6)

A correction for sample selection bias could therefore be found ifwe could estimate P(s = 1|g) However g is typically high-dimensionalmdash

in genetics applications g may have dimensions from 1000 to 1000000P(s = 1|g) is therefore hard to estimate from the small selected sampleWe propose that since g and u are dependent and u typically hasmuch lower dimensionality than g we can approximate P(s = 1|g)by P(s = 1|u) We can therefore write the correction for sampleselection bias as

PDethgTHORN c middot PD9ethgTHORNPeths frac14 1juTHORN (7)

with the quality of the approximation varying as a function of thedependence between g and u In practice we find that the approxi-mate correction method is adequate for most applications since prob-abilistic models are often robust to some differences between the truedistribution of the data and the distribution of the selected sample

It is important to note that even if the selection is determined inreality by the u variables only the correction proposed in Equation 7is only an approximate correction The exact correction would requirecomputing the term P(s = 1|g) which can be written asP

uPeths frac14 1juTHORNPethujgTHORN The second term is a distribution conditionedon g and is hard to specify due to the high dimensionality of g

Implementing correction in learning Although applying the pro-posed correction for accurate probability modeling only requiresan extra multiplication step implementing the correction inlearning models consistent with the true distribution is morecomplex This problem is well-studied as cost-sensitive learning Adiscussion of the ways in which the correction can be applied toclassifiers can be found in Zadrozny et al (2003) In this work wewill use sampling with replacement to implement the correctionTo perform sampling with replacement we sample points in theselected sample (with replacement) with probability proportionalto their correction factor (1p(s = 1|u)) If the selected samplecontains N points the probability of inclusion for the ith point is

given as 1=pethsfrac141juiTHORNPN

jfrac1411=pethsfrac141jujTHORN

Since we sample with replacement our cor-

rected sample can include non-unique points from the selectedsample

METHODSWe will demonstrate the effects of sample selection bias on theaccuracy of ancestry recovery by using experiments on simulated andreal data We will also show how the approximate correction forsample selection bias is effective in practice

Simulation experimentsTo examine the recovery of individual ancestry we simulated datafrom a two-population admixture scenario using populationsfrom HapMap Phase III as the ancestral populations We used 50unrelated individuals each from the CEU (Utah residents withancestry from northern and western Europe) and YRI (Yoruba inIbadan Nigeria) populations as the two ancestral populationsUsing a forward simulator (see File S1) we simulated a 50-50admixture in a single generation followed by six generations ofrandom mating to create a simulated population of 800 diploidindividuals For our experiments we used a 102-Mb region fromchromosome 1 containing 52040 single-nucleotide polymor-phisms (SNPs) The recombination rate was set to be 1028 persite per generation and the mutation rate was set to be 1028 persite per generation Figure 1 shows the proportion of YRI ancestry

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 903

among the individuals in the simulated population From thefigure we can see that most individuals are admixed and there-fore we refer to this simulated population as admixed For ourancestry inference experiments we also included the remainingCEU-unrelated individuals (38 in total) and the remaining YRI-unrelated individuals (50 in total) in the analysis as proxies for theancestral populations In our experiments we will refer to theCEU and YRI individuals as unmixed individuals

We use the proportion of YRI ancestry denoted as ui to representthe ancestry of the ith individual The population used for our experi-ments therefore contained the following three groups

1 100 CEU individuals obtained by duplicating (and in some casestriplicating) 38 unrelated CEU individuals ui = 0

2 100 YRI individuals obtained by duplicating 50 unrelated YRIindividuals ui = 1

3 800 admixed individuals with ui 2 (0 1)

To study the effects of sampling bias we sample individuals fromthe dataset with replacement to generate a dataset of size x + y + zwhere x is the number of CEU individuals z is the number of YRIindividuals and y is the number of admixed individuals By varying xy and z we can generate smaller datasets with different kinds of biasand deviations from the original dataset We choose x and z from103050100 and y from 1030100200400700 Thus the smallestpossible dataset is 101010 (30 individuals) and the largest possibledataset is 100700100 (900 individuals) We will use Sxyz to refer tothe dataset x y z

In this case g comprises the 52040-dimensional genotypes of theindividuals and u comprises the group memberships of each individ-ual according to their ancestry (u 2 1 2 3 with the three groups asdefined earlier) In general the probability distribution of u may notbe known Therefore as a first guess we will assume that u hasa uniform distribution ie PethuTHORN frac14 1

3 for any value of u This assump-tion means that an individual (regardless of inclusion in the studysample) is equally likely to be from the CEU YRI or the admixedpopulation By design s depends only on u and we can write theprobability distribution P(s = 1|u) as

Peths frac14 1ju frac14 1THORN frac14 Pethu frac14 1js frac14 1THORNPeths frac14 1THORNPethu frac14 1THORN frac14 x

x thorn y thorn zPeths frac14 1THORN

1=3

(8)

Peths frac14 1ju frac14 2THORN frac14 Pethu frac14 2js frac14 1THORNPeths frac14 1THORNPethu frac14 2THORN frac14 z

x thorn y thorn zPeths frac14 1THORN

1=3

(9)

Peths frac14 1ju frac14 3THORN frac14 Pethu frac14 3js frac14 1THORNPeths frac14 1THORNPethu frac14 3THORN frac14 y

x thorn y thorn zPeths frac14 1THORN

1=3

(10)

which we can simplify to write

Peths frac14 1ju frac14 1THORN frac14 x middot C (11)

Peths frac14 1ju frac14 2THORN frac14 z middot C (12)

Peths frac14 1ju frac14 3THORN frac14 y middot C (13)

where C is a constant given by3Peths frac14 1THORNx thorn y thorn z

Evaluation measureA fair evaluation of the results for both ADMIXTURE andEIGENSOFT is difficult to achieve because the individual ancestriesproduced by ADMIXTURE and EIGENSOFT are different in natureWith K ancestral populations an individual ancestry vector producedby ADMIXTURE has the form q1 qk such that

PKkfrac141qk frac14 1

Thus it has only K 2 1 independent components An equivalentrepresentation of ancestry can be produced by projecting an individualon the first K2 1 eigenvectors produced by EIGENSOFT The ancestryvectors that we store as the true ancestry when generating the simula-tion data have the same form as those produced by ADMIXTURE

We use squared correlation between the true ancestry and inferredancestry as the measure for evaluating the accuracy of ancestryinference For the dataset Sxyz let Q = u1 ux+y+z denote the trueproportion of YRI ancestry of the individuals Let Q1 = q11 q1x+y+z denote the ancestry proportions in cluster 1 inferred byADMIXTURE for dataset Sxyz with K = 2 where the second subscriptindexes the individuals in the dataset (and therefore we have that Q2 =q21 q2x+y+z = 1 2 q11 1 2 q1x+y+z Then the accuracymetric for ADMIXTURE is correlation(Q Q1)2 (which is equal tocorrelation(Q Q2)2) The corresponding accuracy measure can beconstructed for EIGENSOFT by replacing Q1 by the projections ofthe individual genotypes on the first eigenvector of the Sxyz genotypematrix For robustness we report the mean and standard error of thesquared correlation over 30 datasets for each parameter setting

RESULTS

Sample selection bias in a simulated CEU-YRI admixtureWe examined the effect of biased sampling of individuals by constructingsubsets of the whole dataset from the CEU-YRI admixture describedearlier and measuring the squared correlation between the true andinferred ancestry (File S2) Figure 2A shows the results of this analysiswith the ADMIXTURE software with six subplots Within each sub-plot the number of admixed individuals in the sample remains con-stant and the number of unmixed individuals from the two ancestralpopulations (CEU and YRI) is varied Figure 2B shows the results ofan identical analysis of the same datasets with the EIGENSOFTsoftware

Overall ADMIXTURE recovered individual ancestry better thanEIGENSOFT Previous work has shown that the accuracy of in-dividual ancestry recovery is a function of the FST differentiationbetween the ancestral populations The results we obtain may varyfor datasets with different FST values between the ancestral popula-tions and quantifying this effect will require more study

Figure 1 Proportion of YRI ancestry among individuals in the simulatedpopulation six generations after admixture

904 | S Shringarpure and E P Xing

Figure 2B shows that the EIGENSOFT results exhibit considerableirregular variation within subplots This is likely to be a result of oursampling scheme for creating biased datasets which allows individualsto be present in a dataset more than once This means that theresulting genotype matrix can have lower rank than expected result-ing in lower accuracy of ancestry inference using EIGENSOFT Thishowever does not cause any difficulties in the ADMIXTURE analysisas seen by the regular patterns in Figure 2A and would be expectedbased on the likelihood model underlying ADMIXTURE We willtherefore only use the results in Figure 2B to draw broad conclusionsabout the behavior of EIGENSOFT and not make specific inferencesor recommendations

In the scenarios in which we have few unmixed individuals fromboth ancestral populations Figure 2 A and B show that the accuracyof individual ancestry recovery drops noticeably This effect is evidentin the subplots with 200 400 and 700 admixed individuals In allsubplots in Figure 2A the results show high accuracy when we havea large number of unmixed individuals from at least one ancestral

population Pritchard et al (2000) and Tang et al (2005) have pre-viously noted that a significant number of unmixed individuals fromeach ancestral population is required for accurate recovery of stratifi-cation An initial examination of the results suggests that may besufficient to have a large number (around 502100) of unmixed indi-viduals from just one of the two ancestral populations to be able tocorrectly resolve stratification

We note that this guideline which is relevant when the number ofadmixed individuals is large does not apply if the dataset contains fewadmixed individuals and few unmixed individuals When there arefew admixed individuals both methods perform well (relative to theiraverage performance) even with as few as 10 unmixed individuals inthe dataset Mantel tests reveal that the high correlations obtainedwith few admixed individuals are statistically significant (P 1023) inall cases

To examine the effect of the number of unmixed individuals onthe ancestry recovery in more detail we looked at the subset of thegenerated subsamples which had the same number of unmixed

Figure 2 Squared correlation between thetrue individual ancestry and the individualancestry inferred using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the topeigenvalue The different levelplots are drawnfor different number of admixed individuals inthe dataset The X and Y axes of the plots arelogarithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 905

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 2: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

to correct for sample selection bias in ancestry inference reduceits effects on ancestry estimates We show how such a correctioncan be implemented in practice and demonstrate its effectivenesson simulated and real data

RELATED WORKWe briefly examine methods that can be used for population structureanalysis and the factors that affect their accuracy We also examinerelated work on addressing the problem of sample selection bias indifferent contexts

Methods of population structure analysisA variety of methods have been developed for detecting populationstructure The two main classes of methods used for detectingpopulation structure are model-based methods and eigenanalysisModel-based methods use an explicit admixture model of how thepopulation sample was formed from its ancestral populations TheSTRUCTURE model by Pritchard et al (2000) was one of the earlymethods of this class that is commonly used Extensions to theSTRUCTURE method have been proposed to account for other ob-served evolutionary processes (Falush et al 2003 Huelsenbeck andAndolfatto 2007 Shringarpure and Xing 2009) The frappe method byTang et al (2005) and the ADMIXTURE method by Alexander et al(2009) are alternative ways of solving the optimization problem un-derlying the STRUCTURE model They allow us to efficiently analyzedatasets of large size

The eigenanalysis methods proposed by Price et al (2006) andPatterson et al (2006) project genetic data from individuals intoa low-dimensional space formed from the eigenvectors of the geneticsample Since they do not assume a specific model of populationevolution they can be used in a variety of evolutionary scenarios

Engelhardt and Stephens (2010) showed that the model-basedapproach and the eigenanalysis-based approach to stratification couldboth be interpreted as different ways of factorizing the genotype ma-trix of the given data which suggests that both methods are relateddespite their apparent differences

Factors affecting the accuracy of stratificationA number of factors are known to affect the accuracy of populationstratification and individual ancestry recovery In one of the early workson model-based methods for population stratification Pritchard et al(2000) showed that the number of loci available for analysis had a sig-nificant effect on the recovery of individual ancestry using STRUCTUREKaeuffer et al (2007) studied the effect of linkage disequilibrium onrecovery of population structure using simulated data McVean (2009)suggested an interpretation of the eigenanalysis method that is the basisof the EIGENSOFT method in terms of the coalescence times of indi-viduals They also examined the performance of eigenanalysis in diversedemographic scenarios In the following we discuss the problem ofsample selection bias and some related work about the effect of biasedsampling on accuracy of population stratification

Sample selection biasA common assumption of statistical algorithms is that the available sampleis representative of the underlying population In reality this assumptionmay not always be true Sample selection bias is any systematic differencebetween the sample and the population It affects the internal validity of ananalysis by leading to inaccurate estimation of relationships betweenvariables It also can affect the external validity of an analysis because theresults from a biased sample may not generalize to the population

The problem of sample selection bias was first widely studied ineconometrics where it appeared as a bias among survey respondersHeckman (1979) provided a method of addressing this problem inlinear regression models by estimating the probability of an individualbeing included in the sample Sample selection bias also has beenaddressed in the statistics and machine learning literature by attemptsto understand its effect on classifiers and how estimation and pre-diction can be made correctly in the presence of sampling bias (Vella1998 Zadrozny 2004 Davidson and Zadrozny 2005 Cortes et al2008) Zadrozny (2004) discusses the properties of learning algorithmsand the effect of sample selection bias on their accuracy It also out-lines a possible way of correcting for sample selection bias providedwe know the bias Sample selection bias is also studied in ecologywhen one is trying to model species distributions using presence-onlydata (Phillips et al 2009)

In population genetics sample selection bias could be a seriousproblem because the estimates of ancestry obtained from stratifi-cation analyses often are used to make inferences about the geneticancestry of the sampled individuals The inferred individualancestries also are used as input in correcting for stratification inassociation studies [for instance in ADMIXMAP (Hoggart et al2003)] Pritchard et al (2000) suggest that detecting stratification isdifficult unless a significant number of unmixed individuals fromeach ancestral (or pseudo-ancestral) population is present in thesample This observation was verified by Tang et al (2005) throughexperiments on a small number of simulated datasets To ourknowledge there has not been a systematic study of the effect ofsample selection bias on the accuracy of population structure re-covery and individual ancestry recovery

We propose to study the effect of sample selection bias on theaccuracy of population stratification and individual ancestry recoveryusing a model-based approach (ADMIXTURE) and an eigenanalysis-based approach (EIGENSOFT) Since the analysis of McVean (2009)provides guidelines on the effect of sampling bias on stratificationaccuracy using eigenanalysis we will focus our attention mainly onprobabilistic models such as ADMIXTURE

A MATHEMATICAL FRAMEWORK FOR SAMPLESELECTION BIASWe will consider the problem of studying genotype data usinga probabilistic model Ordinarily for probabilistic modeling we wouldassume that we have genotypes (g) drawn independently from a dis-tribution D (with domain G) over the feature space G We assume thatour points (g u s) are drawn independently from a distribution Dover G middot U middot S where G is the space of genotypes U are someauxiliary features of the data that are not of direct interest for mod-eling and S is a binary space The variable s controls the selection ofpoints (1 means the point is selected and is observed in our sample0 means the point is not selected) Our observed sample contains onlypoints that have s = 1 We will refer to this as the selected sample andrefer to its distribution as D9

We consider the setting where s is independent of g given u that isP(s|g u) = P(s|u) This setting where the selection is controlled byfeatures different from the genotype we want to model arises fre-quently in real applications In population genetics whether an in-dividual is included in a genotyping study often depends on factorssuch as geographical location

Sample selection bias correctionIt is easy to see that if g is independent of u in the previous settingthen sample selection bias has no effect and the probability of g in the

902 | S Shringarpure and E P Xing

selected sample is the same as probability of g under D (asymptotically)If g and u are not independent then we can write using Bayes rule

Pethg uTHORN frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1jg uTHORN (1)

frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1juTHORN (2)

which can be rewritten as

PDethg uTHORN frac14 Peths frac14 1THORNPD9ethg uTHORNPeths frac14 1juTHORN (3)

where D9 represents the distribution of the selected sample Since theterm P(s = 1) is constant with respect to (g u) we can say that

PDethg uTHORN frac14 c middot PD9ethg uTHORNPeths frac14 1juTHORN (4)

where c is a constant that need not be evaluated for tasks such aslearning model parameters Therefore to model PD(g u) accuratelyto a multiplicative constant we can follow the procedure below

1 Compute PD9(g u) using a model learned on the selected sample2 Apply a correction using the term P(s = 1|u) This can be done in

two ways(a) If we know the selection procedure we know P(s = 1|u) and

can directly use it in the model(b) If the selection procedure is not known but we have access to

large number of points for which we know (u s) but not gwe can estimate P(s = 1|u) In the population genetics exam-ple this would correspond to knowing the language or geo-graphical region of an individual and whether or not theycould have been included in the study (since genotypingindividuals to find g for a large number of individuals wouldbe expensive)

However this analysis which can accurately correct for sampleselection bias in the described setting requires a model of both g andu In most applications we are interested in only modeling g and notu For instance although there is interest in modeling the distributionof genotypes distributions of language or geography are not of directinterest in genetics Therefore we consider a similar analysis in thecase where we only model P(g) and attempt to derive a correction forsample selection bias

Approximate correctionWe consider the case when we only want to model P(g) Proceeding ina similar way as before we can write

PethgTHORN frac14 Peths frac14 1THORNPethgjs frac14 1THORNPeths frac14 1jgTHORN (5)

which can be restated as

PDethgTHORN frac14 c middot PD9ethgTHORNPeths frac14 1jgTHORN (6)

A correction for sample selection bias could therefore be found ifwe could estimate P(s = 1|g) However g is typically high-dimensionalmdash

in genetics applications g may have dimensions from 1000 to 1000000P(s = 1|g) is therefore hard to estimate from the small selected sampleWe propose that since g and u are dependent and u typically hasmuch lower dimensionality than g we can approximate P(s = 1|g)by P(s = 1|u) We can therefore write the correction for sampleselection bias as

PDethgTHORN c middot PD9ethgTHORNPeths frac14 1juTHORN (7)

with the quality of the approximation varying as a function of thedependence between g and u In practice we find that the approxi-mate correction method is adequate for most applications since prob-abilistic models are often robust to some differences between the truedistribution of the data and the distribution of the selected sample

It is important to note that even if the selection is determined inreality by the u variables only the correction proposed in Equation 7is only an approximate correction The exact correction would requirecomputing the term P(s = 1|g) which can be written asP

uPeths frac14 1juTHORNPethujgTHORN The second term is a distribution conditionedon g and is hard to specify due to the high dimensionality of g

Implementing correction in learning Although applying the pro-posed correction for accurate probability modeling only requiresan extra multiplication step implementing the correction inlearning models consistent with the true distribution is morecomplex This problem is well-studied as cost-sensitive learning Adiscussion of the ways in which the correction can be applied toclassifiers can be found in Zadrozny et al (2003) In this work wewill use sampling with replacement to implement the correctionTo perform sampling with replacement we sample points in theselected sample (with replacement) with probability proportionalto their correction factor (1p(s = 1|u)) If the selected samplecontains N points the probability of inclusion for the ith point is

given as 1=pethsfrac141juiTHORNPN

jfrac1411=pethsfrac141jujTHORN

Since we sample with replacement our cor-

rected sample can include non-unique points from the selectedsample

METHODSWe will demonstrate the effects of sample selection bias on theaccuracy of ancestry recovery by using experiments on simulated andreal data We will also show how the approximate correction forsample selection bias is effective in practice

Simulation experimentsTo examine the recovery of individual ancestry we simulated datafrom a two-population admixture scenario using populationsfrom HapMap Phase III as the ancestral populations We used 50unrelated individuals each from the CEU (Utah residents withancestry from northern and western Europe) and YRI (Yoruba inIbadan Nigeria) populations as the two ancestral populationsUsing a forward simulator (see File S1) we simulated a 50-50admixture in a single generation followed by six generations ofrandom mating to create a simulated population of 800 diploidindividuals For our experiments we used a 102-Mb region fromchromosome 1 containing 52040 single-nucleotide polymor-phisms (SNPs) The recombination rate was set to be 1028 persite per generation and the mutation rate was set to be 1028 persite per generation Figure 1 shows the proportion of YRI ancestry

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 903

among the individuals in the simulated population From thefigure we can see that most individuals are admixed and there-fore we refer to this simulated population as admixed For ourancestry inference experiments we also included the remainingCEU-unrelated individuals (38 in total) and the remaining YRI-unrelated individuals (50 in total) in the analysis as proxies for theancestral populations In our experiments we will refer to theCEU and YRI individuals as unmixed individuals

We use the proportion of YRI ancestry denoted as ui to representthe ancestry of the ith individual The population used for our experi-ments therefore contained the following three groups

1 100 CEU individuals obtained by duplicating (and in some casestriplicating) 38 unrelated CEU individuals ui = 0

2 100 YRI individuals obtained by duplicating 50 unrelated YRIindividuals ui = 1

3 800 admixed individuals with ui 2 (0 1)

To study the effects of sampling bias we sample individuals fromthe dataset with replacement to generate a dataset of size x + y + zwhere x is the number of CEU individuals z is the number of YRIindividuals and y is the number of admixed individuals By varying xy and z we can generate smaller datasets with different kinds of biasand deviations from the original dataset We choose x and z from103050100 and y from 1030100200400700 Thus the smallestpossible dataset is 101010 (30 individuals) and the largest possibledataset is 100700100 (900 individuals) We will use Sxyz to refer tothe dataset x y z

In this case g comprises the 52040-dimensional genotypes of theindividuals and u comprises the group memberships of each individ-ual according to their ancestry (u 2 1 2 3 with the three groups asdefined earlier) In general the probability distribution of u may notbe known Therefore as a first guess we will assume that u hasa uniform distribution ie PethuTHORN frac14 1

3 for any value of u This assump-tion means that an individual (regardless of inclusion in the studysample) is equally likely to be from the CEU YRI or the admixedpopulation By design s depends only on u and we can write theprobability distribution P(s = 1|u) as

Peths frac14 1ju frac14 1THORN frac14 Pethu frac14 1js frac14 1THORNPeths frac14 1THORNPethu frac14 1THORN frac14 x

x thorn y thorn zPeths frac14 1THORN

1=3

(8)

Peths frac14 1ju frac14 2THORN frac14 Pethu frac14 2js frac14 1THORNPeths frac14 1THORNPethu frac14 2THORN frac14 z

x thorn y thorn zPeths frac14 1THORN

1=3

(9)

Peths frac14 1ju frac14 3THORN frac14 Pethu frac14 3js frac14 1THORNPeths frac14 1THORNPethu frac14 3THORN frac14 y

x thorn y thorn zPeths frac14 1THORN

1=3

(10)

which we can simplify to write

Peths frac14 1ju frac14 1THORN frac14 x middot C (11)

Peths frac14 1ju frac14 2THORN frac14 z middot C (12)

Peths frac14 1ju frac14 3THORN frac14 y middot C (13)

where C is a constant given by3Peths frac14 1THORNx thorn y thorn z

Evaluation measureA fair evaluation of the results for both ADMIXTURE andEIGENSOFT is difficult to achieve because the individual ancestriesproduced by ADMIXTURE and EIGENSOFT are different in natureWith K ancestral populations an individual ancestry vector producedby ADMIXTURE has the form q1 qk such that

PKkfrac141qk frac14 1

Thus it has only K 2 1 independent components An equivalentrepresentation of ancestry can be produced by projecting an individualon the first K2 1 eigenvectors produced by EIGENSOFT The ancestryvectors that we store as the true ancestry when generating the simula-tion data have the same form as those produced by ADMIXTURE

We use squared correlation between the true ancestry and inferredancestry as the measure for evaluating the accuracy of ancestryinference For the dataset Sxyz let Q = u1 ux+y+z denote the trueproportion of YRI ancestry of the individuals Let Q1 = q11 q1x+y+z denote the ancestry proportions in cluster 1 inferred byADMIXTURE for dataset Sxyz with K = 2 where the second subscriptindexes the individuals in the dataset (and therefore we have that Q2 =q21 q2x+y+z = 1 2 q11 1 2 q1x+y+z Then the accuracymetric for ADMIXTURE is correlation(Q Q1)2 (which is equal tocorrelation(Q Q2)2) The corresponding accuracy measure can beconstructed for EIGENSOFT by replacing Q1 by the projections ofthe individual genotypes on the first eigenvector of the Sxyz genotypematrix For robustness we report the mean and standard error of thesquared correlation over 30 datasets for each parameter setting

RESULTS

Sample selection bias in a simulated CEU-YRI admixtureWe examined the effect of biased sampling of individuals by constructingsubsets of the whole dataset from the CEU-YRI admixture describedearlier and measuring the squared correlation between the true andinferred ancestry (File S2) Figure 2A shows the results of this analysiswith the ADMIXTURE software with six subplots Within each sub-plot the number of admixed individuals in the sample remains con-stant and the number of unmixed individuals from the two ancestralpopulations (CEU and YRI) is varied Figure 2B shows the results ofan identical analysis of the same datasets with the EIGENSOFTsoftware

Overall ADMIXTURE recovered individual ancestry better thanEIGENSOFT Previous work has shown that the accuracy of in-dividual ancestry recovery is a function of the FST differentiationbetween the ancestral populations The results we obtain may varyfor datasets with different FST values between the ancestral popula-tions and quantifying this effect will require more study

Figure 1 Proportion of YRI ancestry among individuals in the simulatedpopulation six generations after admixture

904 | S Shringarpure and E P Xing

Figure 2B shows that the EIGENSOFT results exhibit considerableirregular variation within subplots This is likely to be a result of oursampling scheme for creating biased datasets which allows individualsto be present in a dataset more than once This means that theresulting genotype matrix can have lower rank than expected result-ing in lower accuracy of ancestry inference using EIGENSOFT Thishowever does not cause any difficulties in the ADMIXTURE analysisas seen by the regular patterns in Figure 2A and would be expectedbased on the likelihood model underlying ADMIXTURE We willtherefore only use the results in Figure 2B to draw broad conclusionsabout the behavior of EIGENSOFT and not make specific inferencesor recommendations

In the scenarios in which we have few unmixed individuals fromboth ancestral populations Figure 2 A and B show that the accuracyof individual ancestry recovery drops noticeably This effect is evidentin the subplots with 200 400 and 700 admixed individuals In allsubplots in Figure 2A the results show high accuracy when we havea large number of unmixed individuals from at least one ancestral

population Pritchard et al (2000) and Tang et al (2005) have pre-viously noted that a significant number of unmixed individuals fromeach ancestral population is required for accurate recovery of stratifi-cation An initial examination of the results suggests that may besufficient to have a large number (around 502100) of unmixed indi-viduals from just one of the two ancestral populations to be able tocorrectly resolve stratification

We note that this guideline which is relevant when the number ofadmixed individuals is large does not apply if the dataset contains fewadmixed individuals and few unmixed individuals When there arefew admixed individuals both methods perform well (relative to theiraverage performance) even with as few as 10 unmixed individuals inthe dataset Mantel tests reveal that the high correlations obtainedwith few admixed individuals are statistically significant (P 1023) inall cases

To examine the effect of the number of unmixed individuals onthe ancestry recovery in more detail we looked at the subset of thegenerated subsamples which had the same number of unmixed

Figure 2 Squared correlation between thetrue individual ancestry and the individualancestry inferred using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the topeigenvalue The different levelplots are drawnfor different number of admixed individuals inthe dataset The X and Y axes of the plots arelogarithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 905

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 3: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

selected sample is the same as probability of g under D (asymptotically)If g and u are not independent then we can write using Bayes rule

Pethg uTHORN frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1jg uTHORN (1)

frac14 Peths frac14 1THORNPethg ujs frac14 1THORNPeths frac14 1juTHORN (2)

which can be rewritten as

PDethg uTHORN frac14 Peths frac14 1THORNPD9ethg uTHORNPeths frac14 1juTHORN (3)

where D9 represents the distribution of the selected sample Since theterm P(s = 1) is constant with respect to (g u) we can say that

PDethg uTHORN frac14 c middot PD9ethg uTHORNPeths frac14 1juTHORN (4)

where c is a constant that need not be evaluated for tasks such aslearning model parameters Therefore to model PD(g u) accuratelyto a multiplicative constant we can follow the procedure below

1 Compute PD9(g u) using a model learned on the selected sample2 Apply a correction using the term P(s = 1|u) This can be done in

two ways(a) If we know the selection procedure we know P(s = 1|u) and

can directly use it in the model(b) If the selection procedure is not known but we have access to

large number of points for which we know (u s) but not gwe can estimate P(s = 1|u) In the population genetics exam-ple this would correspond to knowing the language or geo-graphical region of an individual and whether or not theycould have been included in the study (since genotypingindividuals to find g for a large number of individuals wouldbe expensive)

However this analysis which can accurately correct for sampleselection bias in the described setting requires a model of both g andu In most applications we are interested in only modeling g and notu For instance although there is interest in modeling the distributionof genotypes distributions of language or geography are not of directinterest in genetics Therefore we consider a similar analysis in thecase where we only model P(g) and attempt to derive a correction forsample selection bias

Approximate correctionWe consider the case when we only want to model P(g) Proceeding ina similar way as before we can write

PethgTHORN frac14 Peths frac14 1THORNPethgjs frac14 1THORNPeths frac14 1jgTHORN (5)

which can be restated as

PDethgTHORN frac14 c middot PD9ethgTHORNPeths frac14 1jgTHORN (6)

A correction for sample selection bias could therefore be found ifwe could estimate P(s = 1|g) However g is typically high-dimensionalmdash

in genetics applications g may have dimensions from 1000 to 1000000P(s = 1|g) is therefore hard to estimate from the small selected sampleWe propose that since g and u are dependent and u typically hasmuch lower dimensionality than g we can approximate P(s = 1|g)by P(s = 1|u) We can therefore write the correction for sampleselection bias as

PDethgTHORN c middot PD9ethgTHORNPeths frac14 1juTHORN (7)

with the quality of the approximation varying as a function of thedependence between g and u In practice we find that the approxi-mate correction method is adequate for most applications since prob-abilistic models are often robust to some differences between the truedistribution of the data and the distribution of the selected sample

It is important to note that even if the selection is determined inreality by the u variables only the correction proposed in Equation 7is only an approximate correction The exact correction would requirecomputing the term P(s = 1|g) which can be written asP

uPeths frac14 1juTHORNPethujgTHORN The second term is a distribution conditionedon g and is hard to specify due to the high dimensionality of g

Implementing correction in learning Although applying the pro-posed correction for accurate probability modeling only requiresan extra multiplication step implementing the correction inlearning models consistent with the true distribution is morecomplex This problem is well-studied as cost-sensitive learning Adiscussion of the ways in which the correction can be applied toclassifiers can be found in Zadrozny et al (2003) In this work wewill use sampling with replacement to implement the correctionTo perform sampling with replacement we sample points in theselected sample (with replacement) with probability proportionalto their correction factor (1p(s = 1|u)) If the selected samplecontains N points the probability of inclusion for the ith point is

given as 1=pethsfrac141juiTHORNPN

jfrac1411=pethsfrac141jujTHORN

Since we sample with replacement our cor-

rected sample can include non-unique points from the selectedsample

METHODSWe will demonstrate the effects of sample selection bias on theaccuracy of ancestry recovery by using experiments on simulated andreal data We will also show how the approximate correction forsample selection bias is effective in practice

Simulation experimentsTo examine the recovery of individual ancestry we simulated datafrom a two-population admixture scenario using populationsfrom HapMap Phase III as the ancestral populations We used 50unrelated individuals each from the CEU (Utah residents withancestry from northern and western Europe) and YRI (Yoruba inIbadan Nigeria) populations as the two ancestral populationsUsing a forward simulator (see File S1) we simulated a 50-50admixture in a single generation followed by six generations ofrandom mating to create a simulated population of 800 diploidindividuals For our experiments we used a 102-Mb region fromchromosome 1 containing 52040 single-nucleotide polymor-phisms (SNPs) The recombination rate was set to be 1028 persite per generation and the mutation rate was set to be 1028 persite per generation Figure 1 shows the proportion of YRI ancestry

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 903

among the individuals in the simulated population From thefigure we can see that most individuals are admixed and there-fore we refer to this simulated population as admixed For ourancestry inference experiments we also included the remainingCEU-unrelated individuals (38 in total) and the remaining YRI-unrelated individuals (50 in total) in the analysis as proxies for theancestral populations In our experiments we will refer to theCEU and YRI individuals as unmixed individuals

We use the proportion of YRI ancestry denoted as ui to representthe ancestry of the ith individual The population used for our experi-ments therefore contained the following three groups

1 100 CEU individuals obtained by duplicating (and in some casestriplicating) 38 unrelated CEU individuals ui = 0

2 100 YRI individuals obtained by duplicating 50 unrelated YRIindividuals ui = 1

3 800 admixed individuals with ui 2 (0 1)

To study the effects of sampling bias we sample individuals fromthe dataset with replacement to generate a dataset of size x + y + zwhere x is the number of CEU individuals z is the number of YRIindividuals and y is the number of admixed individuals By varying xy and z we can generate smaller datasets with different kinds of biasand deviations from the original dataset We choose x and z from103050100 and y from 1030100200400700 Thus the smallestpossible dataset is 101010 (30 individuals) and the largest possibledataset is 100700100 (900 individuals) We will use Sxyz to refer tothe dataset x y z

In this case g comprises the 52040-dimensional genotypes of theindividuals and u comprises the group memberships of each individ-ual according to their ancestry (u 2 1 2 3 with the three groups asdefined earlier) In general the probability distribution of u may notbe known Therefore as a first guess we will assume that u hasa uniform distribution ie PethuTHORN frac14 1

3 for any value of u This assump-tion means that an individual (regardless of inclusion in the studysample) is equally likely to be from the CEU YRI or the admixedpopulation By design s depends only on u and we can write theprobability distribution P(s = 1|u) as

Peths frac14 1ju frac14 1THORN frac14 Pethu frac14 1js frac14 1THORNPeths frac14 1THORNPethu frac14 1THORN frac14 x

x thorn y thorn zPeths frac14 1THORN

1=3

(8)

Peths frac14 1ju frac14 2THORN frac14 Pethu frac14 2js frac14 1THORNPeths frac14 1THORNPethu frac14 2THORN frac14 z

x thorn y thorn zPeths frac14 1THORN

1=3

(9)

Peths frac14 1ju frac14 3THORN frac14 Pethu frac14 3js frac14 1THORNPeths frac14 1THORNPethu frac14 3THORN frac14 y

x thorn y thorn zPeths frac14 1THORN

1=3

(10)

which we can simplify to write

Peths frac14 1ju frac14 1THORN frac14 x middot C (11)

Peths frac14 1ju frac14 2THORN frac14 z middot C (12)

Peths frac14 1ju frac14 3THORN frac14 y middot C (13)

where C is a constant given by3Peths frac14 1THORNx thorn y thorn z

Evaluation measureA fair evaluation of the results for both ADMIXTURE andEIGENSOFT is difficult to achieve because the individual ancestriesproduced by ADMIXTURE and EIGENSOFT are different in natureWith K ancestral populations an individual ancestry vector producedby ADMIXTURE has the form q1 qk such that

PKkfrac141qk frac14 1

Thus it has only K 2 1 independent components An equivalentrepresentation of ancestry can be produced by projecting an individualon the first K2 1 eigenvectors produced by EIGENSOFT The ancestryvectors that we store as the true ancestry when generating the simula-tion data have the same form as those produced by ADMIXTURE

We use squared correlation between the true ancestry and inferredancestry as the measure for evaluating the accuracy of ancestryinference For the dataset Sxyz let Q = u1 ux+y+z denote the trueproportion of YRI ancestry of the individuals Let Q1 = q11 q1x+y+z denote the ancestry proportions in cluster 1 inferred byADMIXTURE for dataset Sxyz with K = 2 where the second subscriptindexes the individuals in the dataset (and therefore we have that Q2 =q21 q2x+y+z = 1 2 q11 1 2 q1x+y+z Then the accuracymetric for ADMIXTURE is correlation(Q Q1)2 (which is equal tocorrelation(Q Q2)2) The corresponding accuracy measure can beconstructed for EIGENSOFT by replacing Q1 by the projections ofthe individual genotypes on the first eigenvector of the Sxyz genotypematrix For robustness we report the mean and standard error of thesquared correlation over 30 datasets for each parameter setting

RESULTS

Sample selection bias in a simulated CEU-YRI admixtureWe examined the effect of biased sampling of individuals by constructingsubsets of the whole dataset from the CEU-YRI admixture describedearlier and measuring the squared correlation between the true andinferred ancestry (File S2) Figure 2A shows the results of this analysiswith the ADMIXTURE software with six subplots Within each sub-plot the number of admixed individuals in the sample remains con-stant and the number of unmixed individuals from the two ancestralpopulations (CEU and YRI) is varied Figure 2B shows the results ofan identical analysis of the same datasets with the EIGENSOFTsoftware

Overall ADMIXTURE recovered individual ancestry better thanEIGENSOFT Previous work has shown that the accuracy of in-dividual ancestry recovery is a function of the FST differentiationbetween the ancestral populations The results we obtain may varyfor datasets with different FST values between the ancestral popula-tions and quantifying this effect will require more study

Figure 1 Proportion of YRI ancestry among individuals in the simulatedpopulation six generations after admixture

904 | S Shringarpure and E P Xing

Figure 2B shows that the EIGENSOFT results exhibit considerableirregular variation within subplots This is likely to be a result of oursampling scheme for creating biased datasets which allows individualsto be present in a dataset more than once This means that theresulting genotype matrix can have lower rank than expected result-ing in lower accuracy of ancestry inference using EIGENSOFT Thishowever does not cause any difficulties in the ADMIXTURE analysisas seen by the regular patterns in Figure 2A and would be expectedbased on the likelihood model underlying ADMIXTURE We willtherefore only use the results in Figure 2B to draw broad conclusionsabout the behavior of EIGENSOFT and not make specific inferencesor recommendations

In the scenarios in which we have few unmixed individuals fromboth ancestral populations Figure 2 A and B show that the accuracyof individual ancestry recovery drops noticeably This effect is evidentin the subplots with 200 400 and 700 admixed individuals In allsubplots in Figure 2A the results show high accuracy when we havea large number of unmixed individuals from at least one ancestral

population Pritchard et al (2000) and Tang et al (2005) have pre-viously noted that a significant number of unmixed individuals fromeach ancestral population is required for accurate recovery of stratifi-cation An initial examination of the results suggests that may besufficient to have a large number (around 502100) of unmixed indi-viduals from just one of the two ancestral populations to be able tocorrectly resolve stratification

We note that this guideline which is relevant when the number ofadmixed individuals is large does not apply if the dataset contains fewadmixed individuals and few unmixed individuals When there arefew admixed individuals both methods perform well (relative to theiraverage performance) even with as few as 10 unmixed individuals inthe dataset Mantel tests reveal that the high correlations obtainedwith few admixed individuals are statistically significant (P 1023) inall cases

To examine the effect of the number of unmixed individuals onthe ancestry recovery in more detail we looked at the subset of thegenerated subsamples which had the same number of unmixed

Figure 2 Squared correlation between thetrue individual ancestry and the individualancestry inferred using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the topeigenvalue The different levelplots are drawnfor different number of admixed individuals inthe dataset The X and Y axes of the plots arelogarithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 905

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 4: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

among the individuals in the simulated population From thefigure we can see that most individuals are admixed and there-fore we refer to this simulated population as admixed For ourancestry inference experiments we also included the remainingCEU-unrelated individuals (38 in total) and the remaining YRI-unrelated individuals (50 in total) in the analysis as proxies for theancestral populations In our experiments we will refer to theCEU and YRI individuals as unmixed individuals

We use the proportion of YRI ancestry denoted as ui to representthe ancestry of the ith individual The population used for our experi-ments therefore contained the following three groups

1 100 CEU individuals obtained by duplicating (and in some casestriplicating) 38 unrelated CEU individuals ui = 0

2 100 YRI individuals obtained by duplicating 50 unrelated YRIindividuals ui = 1

3 800 admixed individuals with ui 2 (0 1)

To study the effects of sampling bias we sample individuals fromthe dataset with replacement to generate a dataset of size x + y + zwhere x is the number of CEU individuals z is the number of YRIindividuals and y is the number of admixed individuals By varying xy and z we can generate smaller datasets with different kinds of biasand deviations from the original dataset We choose x and z from103050100 and y from 1030100200400700 Thus the smallestpossible dataset is 101010 (30 individuals) and the largest possibledataset is 100700100 (900 individuals) We will use Sxyz to refer tothe dataset x y z

In this case g comprises the 52040-dimensional genotypes of theindividuals and u comprises the group memberships of each individ-ual according to their ancestry (u 2 1 2 3 with the three groups asdefined earlier) In general the probability distribution of u may notbe known Therefore as a first guess we will assume that u hasa uniform distribution ie PethuTHORN frac14 1

3 for any value of u This assump-tion means that an individual (regardless of inclusion in the studysample) is equally likely to be from the CEU YRI or the admixedpopulation By design s depends only on u and we can write theprobability distribution P(s = 1|u) as

Peths frac14 1ju frac14 1THORN frac14 Pethu frac14 1js frac14 1THORNPeths frac14 1THORNPethu frac14 1THORN frac14 x

x thorn y thorn zPeths frac14 1THORN

1=3

(8)

Peths frac14 1ju frac14 2THORN frac14 Pethu frac14 2js frac14 1THORNPeths frac14 1THORNPethu frac14 2THORN frac14 z

x thorn y thorn zPeths frac14 1THORN

1=3

(9)

Peths frac14 1ju frac14 3THORN frac14 Pethu frac14 3js frac14 1THORNPeths frac14 1THORNPethu frac14 3THORN frac14 y

x thorn y thorn zPeths frac14 1THORN

1=3

(10)

which we can simplify to write

Peths frac14 1ju frac14 1THORN frac14 x middot C (11)

Peths frac14 1ju frac14 2THORN frac14 z middot C (12)

Peths frac14 1ju frac14 3THORN frac14 y middot C (13)

where C is a constant given by3Peths frac14 1THORNx thorn y thorn z

Evaluation measureA fair evaluation of the results for both ADMIXTURE andEIGENSOFT is difficult to achieve because the individual ancestriesproduced by ADMIXTURE and EIGENSOFT are different in natureWith K ancestral populations an individual ancestry vector producedby ADMIXTURE has the form q1 qk such that

PKkfrac141qk frac14 1

Thus it has only K 2 1 independent components An equivalentrepresentation of ancestry can be produced by projecting an individualon the first K2 1 eigenvectors produced by EIGENSOFT The ancestryvectors that we store as the true ancestry when generating the simula-tion data have the same form as those produced by ADMIXTURE

We use squared correlation between the true ancestry and inferredancestry as the measure for evaluating the accuracy of ancestryinference For the dataset Sxyz let Q = u1 ux+y+z denote the trueproportion of YRI ancestry of the individuals Let Q1 = q11 q1x+y+z denote the ancestry proportions in cluster 1 inferred byADMIXTURE for dataset Sxyz with K = 2 where the second subscriptindexes the individuals in the dataset (and therefore we have that Q2 =q21 q2x+y+z = 1 2 q11 1 2 q1x+y+z Then the accuracymetric for ADMIXTURE is correlation(Q Q1)2 (which is equal tocorrelation(Q Q2)2) The corresponding accuracy measure can beconstructed for EIGENSOFT by replacing Q1 by the projections ofthe individual genotypes on the first eigenvector of the Sxyz genotypematrix For robustness we report the mean and standard error of thesquared correlation over 30 datasets for each parameter setting

RESULTS

Sample selection bias in a simulated CEU-YRI admixtureWe examined the effect of biased sampling of individuals by constructingsubsets of the whole dataset from the CEU-YRI admixture describedearlier and measuring the squared correlation between the true andinferred ancestry (File S2) Figure 2A shows the results of this analysiswith the ADMIXTURE software with six subplots Within each sub-plot the number of admixed individuals in the sample remains con-stant and the number of unmixed individuals from the two ancestralpopulations (CEU and YRI) is varied Figure 2B shows the results ofan identical analysis of the same datasets with the EIGENSOFTsoftware

Overall ADMIXTURE recovered individual ancestry better thanEIGENSOFT Previous work has shown that the accuracy of in-dividual ancestry recovery is a function of the FST differentiationbetween the ancestral populations The results we obtain may varyfor datasets with different FST values between the ancestral popula-tions and quantifying this effect will require more study

Figure 1 Proportion of YRI ancestry among individuals in the simulatedpopulation six generations after admixture

904 | S Shringarpure and E P Xing

Figure 2B shows that the EIGENSOFT results exhibit considerableirregular variation within subplots This is likely to be a result of oursampling scheme for creating biased datasets which allows individualsto be present in a dataset more than once This means that theresulting genotype matrix can have lower rank than expected result-ing in lower accuracy of ancestry inference using EIGENSOFT Thishowever does not cause any difficulties in the ADMIXTURE analysisas seen by the regular patterns in Figure 2A and would be expectedbased on the likelihood model underlying ADMIXTURE We willtherefore only use the results in Figure 2B to draw broad conclusionsabout the behavior of EIGENSOFT and not make specific inferencesor recommendations

In the scenarios in which we have few unmixed individuals fromboth ancestral populations Figure 2 A and B show that the accuracyof individual ancestry recovery drops noticeably This effect is evidentin the subplots with 200 400 and 700 admixed individuals In allsubplots in Figure 2A the results show high accuracy when we havea large number of unmixed individuals from at least one ancestral

population Pritchard et al (2000) and Tang et al (2005) have pre-viously noted that a significant number of unmixed individuals fromeach ancestral population is required for accurate recovery of stratifi-cation An initial examination of the results suggests that may besufficient to have a large number (around 502100) of unmixed indi-viduals from just one of the two ancestral populations to be able tocorrectly resolve stratification

We note that this guideline which is relevant when the number ofadmixed individuals is large does not apply if the dataset contains fewadmixed individuals and few unmixed individuals When there arefew admixed individuals both methods perform well (relative to theiraverage performance) even with as few as 10 unmixed individuals inthe dataset Mantel tests reveal that the high correlations obtainedwith few admixed individuals are statistically significant (P 1023) inall cases

To examine the effect of the number of unmixed individuals onthe ancestry recovery in more detail we looked at the subset of thegenerated subsamples which had the same number of unmixed

Figure 2 Squared correlation between thetrue individual ancestry and the individualancestry inferred using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the topeigenvalue The different levelplots are drawnfor different number of admixed individuals inthe dataset The X and Y axes of the plots arelogarithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 905

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 5: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

Figure 2B shows that the EIGENSOFT results exhibit considerableirregular variation within subplots This is likely to be a result of oursampling scheme for creating biased datasets which allows individualsto be present in a dataset more than once This means that theresulting genotype matrix can have lower rank than expected result-ing in lower accuracy of ancestry inference using EIGENSOFT Thishowever does not cause any difficulties in the ADMIXTURE analysisas seen by the regular patterns in Figure 2A and would be expectedbased on the likelihood model underlying ADMIXTURE We willtherefore only use the results in Figure 2B to draw broad conclusionsabout the behavior of EIGENSOFT and not make specific inferencesor recommendations

In the scenarios in which we have few unmixed individuals fromboth ancestral populations Figure 2 A and B show that the accuracyof individual ancestry recovery drops noticeably This effect is evidentin the subplots with 200 400 and 700 admixed individuals In allsubplots in Figure 2A the results show high accuracy when we havea large number of unmixed individuals from at least one ancestral

population Pritchard et al (2000) and Tang et al (2005) have pre-viously noted that a significant number of unmixed individuals fromeach ancestral population is required for accurate recovery of stratifi-cation An initial examination of the results suggests that may besufficient to have a large number (around 502100) of unmixed indi-viduals from just one of the two ancestral populations to be able tocorrectly resolve stratification

We note that this guideline which is relevant when the number ofadmixed individuals is large does not apply if the dataset contains fewadmixed individuals and few unmixed individuals When there arefew admixed individuals both methods perform well (relative to theiraverage performance) even with as few as 10 unmixed individuals inthe dataset Mantel tests reveal that the high correlations obtainedwith few admixed individuals are statistically significant (P 1023) inall cases

To examine the effect of the number of unmixed individuals onthe ancestry recovery in more detail we looked at the subset of thegenerated subsamples which had the same number of unmixed

Figure 2 Squared correlation between thetrue individual ancestry and the individualancestry inferred using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the topeigenvalue The different levelplots are drawnfor different number of admixed individuals inthe dataset The X and Y axes of the plots arelogarithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 905

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 6: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

individuals from both ancestral populations ie subsets of the formSxyz where the number of unmixed individuals x2103050100 andthe number of admixed individuals y21030100200400700 Foreach value of x (the number of unmixed individuals from each an-cestral population present in the dataset) we observed the effect ofvarying y (the number of admixed individuals present in the dataset)on the ancestry recovery Figure 3A shows the results for ADMIX-TURE and Figure 3B shows the results for EIGENSOFT When thenumber of unmixed individuals is large the methods recover ancestrywell and the number of admixed individuals has no effect on accu-racy However when the number of unmixed individuals in the sam-ple is small adding more admixed individuals to the sample reducesthe accuracy of the ancestry recovery for both ADMIXTURE andEIGENSOFT In Figure 3 A and B we see a threshold effect due tothe number of unmixed individuals when the number of unmixedindividuals changes from 30 to 100

The high accuracy of ancestry recovery when there are fewadmixed and few unmixed individuals suggests that earlier hypothesesabout the requirement of a large number of unmixed individuals foraccurate ancestry recovery may be an incomplete explanation Theresults in Figure 2A and Figure 3A along with the likelihood modelunderlying ADMIXTURE suggest that the effect on accuracy maydepend on the ratio of the number of admixed individuals to unmixedindividuals from each population in the sample For notational con-venience we will refer to this ratio as tsample = yx

To examine this hypothesis we replot the data used for Figure 3Aby examining the correlation measure as a function of the ratio ofadmixed individuals to unmixed individuals in the sample Figure 4 Aand B show the results for this visualization for ADMIXTURE andEIGENSOFT respectively From the figure we can see that the effectof sample selection bias can be understood using tsample The accuracyof ancestry recovery is high while the value of tsample is less than 5(approximately) and drops as this ratio increases This behavior isindependent of the exact number of unmixed individuals in the data-set and can be observed for both ADMIXTURE and EIGENSOFT

In our experiments we observe that individual ancestry can berecovered perfectly as long as tsample 5 Decreasing the number ofadmixed individuals has no adverse effect on the accuracy of ancestryrecovery The effects of sample selection bias on the accuracy ofancestry recovery in a two-population admixture scenario usingADMIXTURE can thus be explained in two scenarios (i) whentsample 5 sample selection bias has no effect on the accuracy ofindividual ancestry recovery and (ii) when tsample 5 the accuracy ofindividual ancestry measured using the correlation measure decreaseswith tsample

Correction of sample selection bias in simulated dataWe examinedthree methods of correcting sample selection bias in the simulateddata The methods and their results are described herein

Supervision ADMIXTURE can be run in supervised mode whereindividuals of known ancestry can be specified in advance to belong toexactly one of K populations being inferred This supervision is partialsince only nonadmixed individuals can be part of the supervised inputprovided to the algorithm In our experiments we include supervisionby assigning the individuals of known YRI and CEU ancestry to thetwo ancestral populations We find that supervision produces an im-provement that is statistically significant but practically insignificant( 00001 improvement in squared correlation P 22 middot 10216 ina one-sided paired t-test)

Including more SNPs Our analysis of the CEU-YRI admixtureused 52040 SNPs in a 102-Mb region We found that including more

SNPs in the ADMIXTURE analysis reduced the effects of sampleselection bias for identical sample sizes and samples When 116415SNPs are included in the ADMIXTURE analysis we observe no effectsof sample selection bias on the accuracy of ancestry inferenceAlthough we do not thin SNPs in our analysis thinning the originalset of 52040 SNPs by an r2 threshold of 01 produces a set of 35500SNPs Reanalyzing the data with this subset of SNPs produces a smallloss in accuracy (05 in squared correlation P 22 middot 10216 ina one-sided paired t-test) when compared with the original set of52040 SNPs Thus while adding independent SNPs improves accu-racy of ancestry inference adding or removing linked SNPs has littleeffect on accuracy

Resampling correction We implemented the resampling correc-tion in the section Approximate correction by using Equations 11213Since we do not have an informative prior expectation of the truenumbers of admixed individuals or the unmixed individuals we usedthe uninformative initial estimate of all groups of individuals havingequal population sizes We found that in more than 99 of thecorrected datasets the squared correlation between true and inferredancestry was larger than 095 (mean = 098 standard deviation =001) Thus the resampling correction is effective for correcting sampleselection bias

Impact of effective population size on sampleselection biasIn the previous simulation we used a correction based on anassumption of similar population sizes for the different groups Ingeneral the census (observed) population size and the effective

Figure 3 Effect of adding more admixed individuals to the dataset onthe correlation measure of accuracy when using (A) ADMIXTURE withK = 2 and (B) EIGENSOFT with the top eigenvalue The error barsindicate standard error

906 | S Shringarpure and E P Xing

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 7: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

population size of a population need not be equal due to demographicevents that the population may have undergone We attempt todemonstrate that the effect of biased sampling is dependent on theeffective population size and not only on the census population size

We simulated two populations according to the demographicscenario shown in Figure 5 using the coalescent simulatorms (Hudson2002) The founding population and the populations resulting fromthe split 1000 generations before present were set to have identicalsizes (N0 = 10000) Population 2 then undergoes an instantaneouscontraction 795 generations after the split with its population sizebeing reduced to aN0 The contraction factor a was varied acrossdatasets taking values of 099 (nearly no contraction) 05 (moderatecontraction) or 01 (strong contraction) Two-hundred generationsafter the contraction population 2 instantaneously grows back in sizeto N0 The simulation was continued for five generations after therecovery after which 50 diploid individuals were sampled from eachpopulation Each simulation used a mutation rate and recombinationrate of 1028 per bp per generation In each simulation 200 indepen-dent sites of 100 kb were generated producing between 47585 and58024 SNPs per simulation For each value of a (from 0105099)30 independent datasets were generated to evaluate the robustness ofresults (File S3)

The two populations from a single simulation were pooledtogether to form a 5050 admixture in a single generation to create50 admixed individuals This was followed by five generations ofrandom mating in the resulting admixed population using thesimulator described in the Introduction section We then generated

a sample for study by including the 50 individuals from population 150 admixed individuals and a variable number of individuals (x) frompopulation 2 (x 2 5 10 20 30 40 50) ADMIXTURE was used withK = 2 to infer the ancestry proportions for each individual and thesquared correlation coefficient (as described in the section Evaluationmeasure) for the dataset was used a measure of accuracy

Figure 6 shows the results of the ancestry inference with one curvefor each value of the contraction factor a As expected from the pre-vious experiments as the number of individuals from population 2 inthe dataset increases the accuracy of ancestry inference increaseswhich is demonstrated by each curve We also observe that the curvesfor the three different contraction factors have different accuraciesbefore they converge to a accuracy of almost 1

By our experimental design the two ancestral populations haveidentical census population sizes for each simulation and the censussize for population 2 is identical across simulations However due tothe bottleneck the effective population size of population 2 is reducedwith the amount of reduction dependent to the strength of thecontraction (the calculations for reduction in effective population sizefor population 2 can be found in File S1) Figure 6 thus demonstratesthat effective population size and not just census population size hasan impact on how ancestry inference is affected by sampling biasFrom the figure we can see that for the strongest contraction (a =01 producing the maximum reduction in the effective population sizefor population 2 to 036N0) the accuracy of ancestry inference is higheven with only 5 individuals from population 2 Since effective pop-ulation size is a measure of the genetic diversity of a population thissuggests that the reduced genetic diversity produced by the strongcontraction can be adequately captured even with a small numberof individuals For the same number of individuals from population2 across the curves we see that the accuracy is smallest for theweakest contraction (a = 099) and largest for the strongest contrac-tion (a = 01) Thus even when census population sizes are identicalvariation in effective population size affects the impact of samplingbias accuracy of ancestry inference This suggests that sampling biascan be more accurately captured by defining it in terms of effectivepopulation size rather than census population size

Correction of bias Using the estimate for the effective population sizeof population 2 from File S1 we can implement a resampling

Figure 5 A two-population demographic scenario Populations 1 and2 (of size N0 = 104) are formed from a single population of the samesize 1000 generations before present time Population 2 then under-goes a bottleneck for 200 generations followed by a instantaneousrecovery to its original population size

Figure 4 Effect of the ratio of number admixed individuals to unmixedindividuals in the dataset (tsample) on the correlation measure of accu-racy using (A) ADMIXTURE with K = 2 and (B) EIGENSOFT with the topeigenvalue The error bars indicate standard error The X-axis is loga-rithmic in scale

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 907

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 8: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

correction as before We assume that the effective population size forthe admixed population is identical to that for population 1 ie N0 =104 Figure 7 shows the change in accuracy after application of theresampling correction (relative to the accuracies in Figure 6)

From the figure we can see that while the accuracy for the originaldataset is less than 095 the resampling correction produces animprovement in the results denoted by the portions of the curves inbold However beyond those points as the datasets acquire an excessof the unmixed individuals relative to the admixed individuals we seethat the uncorrected accuracy is high (larger than 095) and theresampling correction decreases the accuracy of ancestry inferenceThus a resampling correction is unnecessary when there is an excessof unmixed individuals relative to their expected number fromeffective population size calculations

Sample selection bias in cattle genotype dataTo demonstrate the effects of sample selection bias on a real geneticdataset we used the data from (McTavish et al 2013ab) who gen-otyped 1461 individuals belonging to cattle breeds across the world at47506 SNPs We focus on admixture between African and Europeancattle breeds as seen in their New World descendants For theirexperiments with STRUCTURE McTavish et al (2013b) used a subsetof 1814 SNPs that was common to all their genotyping chips Our datatherefore consists of genotype data at 1814 SNPs from 40 New Worldindividuals (Texas Longhorns) 100 European individuals (Limousinfrom Southern Europe) and 100 African individuals (NrsquoDama andNrsquoDamaXBoran)

We created multiple subsets from the original data by including allthe 40 admixed New World individuals and a variable number (x) ofunmixed individuals from the European and African populations(with the same number from each population) The number of un-mixed individuals chosen was increased gradually from 10 to 100 (x 210 20 30 40 50 60 80 100) For each value of x we generated 30datasets for statistical robustness ADMIXTURE was used with K = 2to infer ancestry proportions for all individuals The squared correla-tion coefficient for the ancestry proportions of the admixed individ-uals was used as a measure of accuracy with the ancestry proportionsinferred on the full data used as a truth set

Figure 8 shows the results of the ancestry inference as a function ofthe number of unmixed individuals (x) We see that as the number of

unmixed individuals increases the accuracy of ancestry inferenceimproves When the number of unmixed individuals is less than 40the squared correlation measure is less than 09

Correction of bias For the cattle data effective population sizes forthe three populations are unknown We assume that all threepopulations have equal effective population size The correction isimplemented using resampling as before The blue curve in Figure 8shows the effect of the correction on accuracy From the figure we cansee that when the number of admixed individuals is larger than thenumber of unmixed individuals (tsample $ 1 as defined previously)the correction improves accuracy but causes a small drop in accuracywhen there are many unmixed individuals

DISCUSSIONBiased sampling of individuals is a result of practical constraints onthe study design of genotypingsequencing projects In this aspectsample selection bias is similar to bias due to SNP ascertainment Itcan also occur as a result of combining multiple genomic datasetsfrom different sources For instance the overlap of three genotypingchips required McTavish et al (2013b) to use only 1814 SNPs for theirSTRUCTURE analysis

In most stratification analyses the recovered ancestry proportionsare used to make inferences about the ancestry of the sample and thegenetic relationships of similaritydifferences between the studypopulations Ancestry proportion estimates are also used in associa-tion studies to account for effects of stratification Therefore it isessential to have accurate recovery of ancestry Our experimentssuggest that sample selection bias can be a problem in accuratepopulation stratification and recovering individual ancestry Oursimulations show that unlike observations from previous studies theaccuracy of individual ancestry recovery is not dependent only on thenumber of unmixed individuals present in the sample We observea threshold effect where the accuracy of ancestry inference is affectedby sample selection bias depending on the ratio of admixedindividuals to unmixed individuals in the sample Our simulationssuggest that the accuracy of ancestry inference is affected when thisratio tsample is less than 5 For real genotype data from cattle we

Figure 7 Change in correlation measure from Figure 6 after theresampling correction taking into account the impact of the contrac-tion on effective population size Each curve is drawn for a single valueof the contraction factor a and shows the mean and standard error ofthe change in correlation measure over 30 datasets The regions of thecurves in bold show regions where accuracy in the uncorrected results(from Figure 6) is less than 095

Figure 6 Impact of effective population size on the correlationmeasure of accuracy of ancestry inference using ADMIXTURE withK = 2 Each curve is drawn for a single value of the contraction factor aand shows the mean and standard error of the correlation measureover 30 datasets

908 | S Shringarpure and E P Xing

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 9: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

observed that a ratio of tsample larger than 1 was sufficient to affectancestry estimates

Our experiments do not directly demonstrate an effect of FST onsample selection bias However using the 1814 SNPs for the genotypedata we were able to observe effects of sample selection bias onAfrican-European admixture in cattle (FST 015) but unable toobserve any effect of sample selection bias on European-Indian ad-mixture (FST 028 results not shown here) This suggests that FSThas an effect on sample selection bias with well-differentiated pop-ulations being easy to separate even with biased sampling The sim-ulations and real data analysis had relatively large FST between the twoancestral populations ( 01 for the simulations 015 for the cattlegenotype data) but vastly different number of SNPs (52040 for thesimulation but only 1814 for the cattle genotype data) The degrada-tion in performance of ancestry inference in the two cases was con-siderably different suggesting that the effect of sample selection biasalso depends on the number of loci available This was also validatedby the observation that doubling the number of SNPs in simulationeliminated the effects of sample selection bias completely

We also showed through simulations that sample selection biasdepends on the effective population size rather than simply the censuspopulation size of the populations studied This suggests that samplingfor genotypingsequencing projects can be improved (in terms ofdiversity of sampling and cost) by taking into account independentknowledge of the demographic history of the populations beingsampled

Although our analyses use two specific methods (ADMIXTUREand EIGENSOFT) the effects we observe are a feature of theassumptions underlying both methods rather than the specificimplementations ADMIXTURE is a representative of the likelihood-based models that assume (a) admixture between ancestral popula-tions and (b) that modern individual genomes are mixture ofcontributions from different ancestral populations This is themodel underlying STRUCTURE and frappe and to a large extentthe extensions mentioned earlier Likelihood-based methods aresusceptible to effects of sample selection bias since each individual isgiven equal weight in the sample a priori This is a result of thefundamental assumption of learning methods that the sampleobserved is representative of the underlying population distributionEigenanalysis which also weighs each individual equally apriori alsosuffers from a similar problem as the likelihood-based method The

sensitivity of eigenanalysis to sample size variation and outliers is well-known and is also reported by McVean (2009) and Novembre andStephens (2008)

ADMIXTURE and EIGENSOFT are unsupervised methods forperforming ancestry inference ie they do not leverage informationabout individuals of known ancestry to estimate their respective modelparameters ADMIXTURE supports a mode (which is referred to asa supervised mode in the documentation) in which some individualscan be ldquolabeledrdquo ie assigned to have ancestry from only a singlecluster However all the individuals labeled or otherwise are usedto learn the allele frequency parameters of the ADMIXTURE modelIn machine learning literature this mode of learning is commonlycalled ldquosemi-supervisedrdquo learning wherein a combination of labeledand unlabeled data is used to learn model parameters In simulationswe observed that this mode of ancestry inference is also affected bysample selection bias

A different class of methods can estimate local (subcontinental)ancestry from dense genotype data (Sankararaman et al 2008 Priceet al 2009 Wall et al 2011 Baran et al 2012) A common theme of allof these methods is the use of external sources of genotype data asldquoreference panelsrdquo (referred to as ldquotraining datardquo in machine learningliterature) which are used to estimate ancestral allele frequencies orancestral haplotypes The estimated parameters (frequencies or hap-lotypes) are then used to infer the ancestry of the individuals in thestudy sample based on their genotypes (referred to as ldquotest datardquo inmachine learning literature) Therefore these methods can be said tobe performing supervised learning Supervised learning using externalsources of data for estimating model parameters is robust to biasedsampling in the study sample but is sensitive to the reference panelsused for training (Zadrozny et al 2003)

We proposed a resampling correction for sample selection biasusing a mathematical framework modeling the sampling process Theproposed correction requires knowledge of some auxiliary informa-tion about the selection criteria that is correlated with the genotypes(bias is present only if selection is dependent on the genotypesdirectly or indirectly) For genetic datasets geography provides onesuch criterion that is easy to acquire during data collection Using thisinformation we proposed a correction that is easy to implement Weshowed using simulation experiments that such a correction iseffective in practice and leads to more accurate results In terms ofcomputational effort ADMIXTURE has a linear dependence on thenumber of individuals The resampling correction will therefore requiretime proportional to the number of individuals in the resampled setHowever a better implementation of the resampling correction forADMIXTURE would be to be weigh the likelihood contribution fromeach individual in the original sample by its resampling correctionfactor (1p(s = 1|u)) This would allow the correction to be implementedwithout any increase in computational complexity of the ADMIXTUREalgorithm The resampling correction implemented naively can causeproblems for EIGENSOFT due to the possibility of colinearity resultingfrom duplication of individuals as evident in Figure 2B A weightedeigenanalysis would avoid this problem without affecting the efficiencyof the analysis

An alternative method for correcting sample selection bias is toadd more markers to the analysis as observed from the simulationAdding more SNPs to the analysis is an appealing solution because itdoes not require any further assumptions that any statisticalcorrection may require with the exception of the assumption of nolinkage between SNPs However this may not always be possible if thestudy dataset is constructed by intersecting multiple datasets contain-ing the individuals of interest genotyped on different platforms

Figure 8 Variation in accuracy of ancestry inference with changingnumber of unmixed individuals The red curve shows accuracy withoutcorrection and the blue curve shows the accuracy when the correctionis applied

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 909

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 10: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

Our simulation experiments used simple two-population admix-ture scenarios to examine the effects of sample selection bias on theaccuracy of stratification In one set of experiments we used twopopulations (CEU and YRI) that were easily separable and examinedthe admixed population resulting after a single recent admixtureevent In reality the demographic processes underlying the evolutionof populations are much more complicated In such scenarios it isreasonable to expect that the stratification problem may be harder toresolve and would suffer from the effects of sample selection bias moreseverely

A limitation of our proposed correction is the requirement forexternal knowledge of the selection criteria for samples In ourexperiments we either knew the nature of the bias (by design in thesimulation experiments) or assumed that it was known In general theaccuracy of the correction method proposed will strongly depend onthe relationship between the selection criteria and the genotypesCorrections factors therefore may be dataset-specific An alternativefuture direction for correcting sample selection bias would be todevelop models of population structure that can also model theauxiliary factors such as geography or language that may determinethe selection process responsible for the biased sampling There maybe occasions where the nature of the sampling bias is partially knownwithout having sufficient information to allow a correction to bedeveloped For instance analyses of the POPRES dataset (Nelson et al2008) often select individuals who are known to have all four grand-parents of the same ancestry (Novembre et al 2008) In this case thecriteria for selection can be stated explicitly but a mathematical for-mulation of the bias is hard to obtain In such scenarios we recom-mend subsamplingresampling datasets to examine the robustness ofancestry inference results to variation in sampling

In summary our experiments suggest that sample selection biasaffects ancestry inference and depends on factors such as effectivepopulation size genetic differentiation between populations numberof loci available and the ratio of admixed samples to unmixedsamples For analyses of human genotype data hundreds of thousandsof SNPs are typically available Therefore we do not expect sampleselection bias to be a problem for ancestry inference in human dataexcept when the source populations for admixture are not well-differentiated Local ancestry methods which can make use of externalgenotype data also provide a way of avoiding the effects of sampleselection bias We expect that sample selection bias will be of interestin ancestry analysis for model organisms where availability of externalgenotype data are limited and most genotyping chips have relativelyfewer loci When the sampling process is relatively well-understooda mathematical correction as we have proposed will help to reduce theeffects of sample selection bias In our experiments a rough guidelineof using approximately identical number of admixed and unmixedindividuals reduces the effects of biased sampling When the samplingprocess cannot be modeled accurately examining the robustness ofancestry estimates to sampling may help avoid inaccurate ancestryestimation

ACKNOWLEDGMENTSThis research was supported in part by NSF Grant DBI-0546594 andNIH Grant R01GM087694

LITERATURE CITEDAlexander D H J Novembre and K Lange 2009 Fast model-based es-

timation of ancestry in unrelated individuals Genome Res 19 1655ndash1664

Baran Y B Pasaniuc S Sankararaman D G Torgerson C Gignoux et al2012 Fast and accurate inference of local ancestry in Latino popula-tions Bioinformatics 28 1359ndash1367

Cavalli-Sforza L L 2005 The Human Genome Diversity Project pastpresent and future Nat Rev Genet 6 333ndash340

Cortes C M Mohri M Riley and A Rostamizadeh 2008 Sample selec-tion bias correction theory Algorithmic Learning Theory 5254 16

Davidson I and B Zadrozny 2005 An improved categorization of clas-sifiers sensitivity on sample selection bias pp 605ndash608 in Fifth IEEEInternational Conference on Data Mining ICDM05 Available athttpieeexploreieeeorgxplarticleDetailsjsptp=amparnumber=1565737ampqueryText3DAn+improved+categorization+of+classifiers+sensitivity+on+sample+selection+bias Accessed March 26 2014

Engelhardt B E and M Stephens 2010 Analysis of population structurea unifying framework and novel methods based on sparse factor analysisPLoS Genet 6 12

Falush D M Stephens and J K Pritchard 2003 Inference of populationstructure using multilocus genotype data linked loci and correlated allelefrequencies Genetics 164 1567ndash1587

Gibbs R A 2003 The International HapMap Project Nature 426 789ndash796Heckman J J 1979 Sample selection bias as a specification error Econo-

metrica 47 153ndash161Hoggart C J E J Parra M D Shriver C Bonilla R A Kittles et al

2003 Control of confounding of genetic associations in stratified pop-ulations Am J Hum Genet 72 1492ndash1504

Hudson R R 2002 Generating samples under a Wright-Fisher neutralmodel of genetic variation Bioinformatics 18 337ndash338

Huelsenbeck J P and P Andolfatto 2007 Inference of population struc-ture under a Dirichlet process prior Genetics 175 1787ndash1802

Kaeuffer R D Reacuteale D W Coltman and D Pontier 2007 Detectingpopulation structure using STRUCTURE software effect of backgroundlinkage disequilibrium Heredity 99 374ndash380

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013a Data from New World cattle show ancestry from multiple in-dependent domestication events Available at httpdatadryadorgresourcedoi105061dryad42tr0 Accessed March 26 2014

McTavish E J J E Decker R D Schnabel J F Taylor and D M Hillis2013b New World cattle show ancestry from multiple independentdomestication events Proc Natl Acad Sci USA 110 E1398ndashE1406

McVean G 2009 A genealogical interpretation of principal componentsanalysis PLoS Genet 5 e1000686

Nelson M R K Bryc K S King A Indap A R Boyko et al 2008 ThePopulation Reference Sample POPRES a resource for population dis-ease and pharmacological genetics research Am J Hum Genet 83 347ndash358

Novembre J and M Stephens 2008 Interpreting principal componentanalyses of spatial population genetic variation Nat Genet 40 646ndash649

Novembre J T Johnson K Bryc Z Kutalik A R Boyko et al2008 Genes mirror geography within Europe Nature 456 98ndash101

Patterson N A L Price and D Reich 2006 Population structure andeigenanalysis PLoS Genet 2 e190

Phillips S J M Dudiacutek J Elith C H Graham A Lehmann et al2009 Sample selection bias and presence-only distribution modelsimplications for background and pseudo-absence data Ecol Appl 19181ndash197

Price A L N J Patterson R M Plenge M E Weinblatt N A Shadicket al 2006 Principal components analysis corrects for stratification ingenome-wide association studies Nat Genet 38 904ndash909

Price A L A Tandon N Patterson K C Barnes N Rafaels et al2009 Sensitive detection of chromosomal segments of distinct ancestryin admixed populations PLoS Genet 5 e1000519

Pritchard J K M Stephens and P Donnelly 2000 Inference of populationstructure from multilocus genotype data Genetics 155 945ndash959

Rosenberg N A J K Pritchard J L Weber H M Cann K K Kidd et al2002 Genetic Structure of Human Populations Science 298 2381ndash2385

Sankararaman S G Kimmel E Halperin and M I Jordan 2008 On theinference of ancestries in admixed populations Genome Res 18 668ndash675

910 | S Shringarpure and E P Xing

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911

Page 11: Effects of Sample Selection Bias on the Accuracy of ...epxing/papers/2014/Shringarpure_Xing_GGG...The problem of sample selection bias was first widely studied in econometrics, where

Shringarpure S and E P Xing 2009 mStruct Inference of PopulationStructure in Light of Both Genetic Admixing and Allele MutationsGenetics 182 575ndash593

Tang H J Peng P Wang and N J Risch 2005 Estimation of individualadmixture analytical and study design considerations Genet Epidemiol28 289ndash301

Vella F 1998 Estimating models with sample selection bias a surveyJ Hum Resour 33 127ndash169

Wall J D R Jiang C Gignoux G K Chen C Eng et al 2011 Geneticvariation in Native Americans inferred from Latino SNP and rese-quencing data Mol Biol Evol 28 2231ndash2237

Zadrozny B 2004 Learning and evaluating classifiers under sample selec-tion bias pp 114 in Twenty-First International Conference on MachineLearning ICML 04 ACM New York NY

Zadrozny B J Langford and N Abe 2003 Cost-sensitive learning by cost-proportionate example weighting pp 435ndash442 in Third IEEE Interna-tional Conference on Data Mining ICDM 2003 Available at httpieeexploreieeeorgxplloginjsptp=amparnumber=1250950ampurl=http3A2F2Fieeexploreieeeorg2Fxpls2Fabs_alljsp3Farnumber3D1250950 Accessed March 27 2014

Communicating editor D-J De Koning

Volume 4 May 2014 | Biased Sampling in Ancestry Inference | 911