simulating genes in genome-wide association studies

45
Simulating Genes in GWAS Kevin R. Thornton Ecology and Evolutionary Biology UC Irvine slides will be available at http://www.slideshare.net/molpopgen http://www.molpopgen.org

Upload: kevin-thornton

Post on 14-Jun-2015

383 views

Category:

Science


0 download

DESCRIPTION

Talk given to the UCI Genetic Epidemiology Research Group (GERI, http://www.geri.uci.edu/) on May 16, 2014. Recent results on power to detect associations in growing populations + need for better statistical tests.

TRANSCRIPT

Page 1: Simulating Genes in Genome-wide Association Studies

Simulating Genes in GWAS

Kevin R. Thornton Ecology and Evolutionary Biology

UC Irvine

slides will be available at http://www.slideshare.net/molpopgen

http://www.molpopgen.org

Page 2: Simulating Genes in Genome-wide Association Studies

Acknowledgements

Tony Long Andrew Foran Jaleal Sanjak

Page 3: Simulating Genes in Genome-wide Association Studies

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

doi:10.1038/nature05911Burton et al.

Page 4: Simulating Genes in Genome-wide Association Studies

The questions arise as to why so much of the heritability is apparentlyunexplained by initial GWA findings, and why it is important. It isimportant because a substantial proportion of individual differencesin disease susceptibility is known to be due to genetic factors, andunderstanding this genetic variation may contribute to better preven-tion, diagnosis and treatment of disease. It is important to recognize,however, that few investigators expected these studies immediately tofind all of the variants associated with common diseases, or even most ofthem; the hope was that they would at least find some16. Limitations inthe design of early GWAS, such as imprecise phenotyping and the use ofcontrol groups of questionable comparability, may have reduced esti-mates of effect sizes while preserving some ability to identify associatedvariants17. These studies have considerably surpassed early expectations,reproducibly identifying hundreds of variants in many dozens of traits,but for many traits they have explained only a small proportion ofestimated heritability18.

Many explanations for this missing heritability have been sug-gested, including much larger numbers of variants of smaller effectyet to be found; rarer variants (possibly with larger effects) that arepoorly detected by available genotyping arrays that focus on variantspresent in 5% or more of the population; structural variants poorlycaptured by existing arrays; low power to detect gene–gene interac-tions; and inadequate accounting for shared environment amongrelatives. Consensus is lacking, however, on approaches and priorit-ies for research to examine what has been termed ‘dark matter’ ofgenome-wide association—dark matter in the sense that one is sure itexists, can detect its influence, but simply cannot ‘see’ it (yet). Herewe examine potential sources of missing heritability and proposeresearch strategies to illuminate the genetics of complex diseases.

Heritability and allelic architecture of complex traitsIt is reasonable to assume that allelic architecture (number, type, effectsize and frequency of susceptibility variants) may differ across traits,and that missing heritability may take a different form for differentdiseases19, but at present our understanding is too limited to distin-guish these possibilities. Age-related macular degeneration may pro-vide the best example of a common disease in which heritability issubstantially explained by a small number of common variants of largeeffect20, but for other conditions, such as Crohn’s disease, the propor-tion of heritability explained is not nearly so large despite a muchlarger number of identified variants21 (Table 1). There are no obviousdifferences between these two traits in genetic architecture as pre-dicted from clinical and epidemiological data that would explainthe differences observed in their allelic architecture. Some apparentdifferences may simply be due to differences in the stage of investiga-tion across traits. Studies in several conditions have clearly demon-strated that the number of detected variants increases with increasingsample size22–24.

Population genetic theory suggests an explanation for the paucityof variants explaining a large proportion of disease predisposition, inthat decreased reproductive fitness should typically act to reduce thefrequencies of high-risk variants. This might explain the relative lackof variants detected so far for some neuropsychiatric conditions, suchas autism spectrum disorders, given their low reproductive fitness25.Yet for a condition such as type 1 diabetes, which has a similar pre-valence, familial risk, early onset and poor reproductive fitness (at

least before the discovery of insulin therapy), more than 40 loci havealready been reported; this might be because the overall sample sizesstudied in type 1 diabetes have been very large26,27. Present-day repro-ductive fitness may correlate poorly with the forces that have shapedvariation throughout human evolution; moreover focusing on thereproductive effects of a single disease ignores the pleiotropic effects(effects of the same variant on multiple characteristics or diseaserisks) of multiple alleles influencing that condition simultaneouslywith many other conditions28.

Selection might also be responsible for keeping genetic effect sizeslow, as variants of larger effect may be selected against and eventuallydisappear19. Long-term stabilizing selection minimizes the produc-tion of individuals at the extremes of a trait29, in part by reducing theadditive genetic effects of alleles already present or those arising denovo by mutation30 to levels potentially beneath the ability of studiesof feasible size to detect them. Selection may also contribute to dif-ferences in the ability to detect loci in different complex diseases, ifgenetic susceptibility to some diseases is more strongly affected byselection than other diseases, or if environmental perturbations varyin intensity across diseases. Immune and infectious agents have beenrecognized as among the strongest selection pressures in humanevolution31, and immune-related genes have been strongly impli-cated in Crohn’s disease and other immune-mediated diseases3, sug-gesting either that pleiotropic effects of these variants reduce theefficiency of negative selection, or that strong environmental per-turbation in modern societies might expose the disease risk asso-ciated with these variants. Selection may thus explain why diseaseallele frequencies are low and allelic effects are small, but this shouldmanifest as low, rather than missing, heritability.

A probable contributor to the small genetic effect sizes observed sofar is that current investigations have incompletely surveyed thepotential causal variants within each gene. Relative risks observedfor marker SNPs may underestimate the actual risks associated withthe true causal variants. Notably, 11 out of 30 genes implicated ascarrying common variants associated with lipid levels also carryknown rare alleles of large effect identified in Mendelian dyslipide-mias, including ABCA1, PCSK9 and LDLR22,32, suggesting that genescontaining common variants with modest effects on complex traitsmay also contain rare variants with larger effects.

An important consideration is that the overwhelming majority ofGWAS and other genetic studies have been limited to Europeanancestry populations, whereas genetic variation is greatest in popula-tions of recent African ancestry2, and studies in non-Europeans haveyielded intriguing new variants33,34. Studies of populations of recentAfrican ancestry in particular is likely to increase the yield of rarevariants and narrow the large chromosomal regions of associationidentified in the ‘younger’ population due to extended linkage dis-equilibrium, or the tendency for adjacent genetic loci to be inheritedtogether31. Isolated populations may also be of value given theirpotential to be enriched in unique variants35.

The accuracy of current heritability estimates is also important,because experimentally identified variants could never explain all thevariance in an erroneously inflated heritability estimate. Heritabilityof quantitative traits, formally defined as the proportion of pheno-typic variance in a population attributable to additive genetic factors(narrow-sense heritability, h2 (ref. 36)) is typically estimated from

Table 1 | Estimates of heritability and number of loci for several complex traits

Disease Number of loci Proportion of heritability explained Heritability measure

Age-related macular degeneration72 5 50% Sibling recurrence riskCrohn’s disease21 32 20% Genetic risk (liability)Systemic lupus erythematosus73 6 15% Sibling recurrence riskType 2 diabetes74 18 6% Sibling recurrence riskHDL cholesterol75 7 5.2% Residual* phenotypic varianceHeight15 40 5% Phenotypic varianceEarly onset myocardial infarction76 9 2.8% Phenotypic varianceFasting glucose77 4 1.5% Phenotypic variance

*Residual is after adjustment for age, gender, diabetes.

REVIEWS NATUREjVol 461j8 October 2009

748 Macmillan Publishers Limited. All rights reserved©2009

doi:10.1038/nature08494Manolio et al.

Page 5: Simulating Genes in Genome-wide Association Studies

NHGRI GWA Catalog

www.genome.gov/GWAStudies

www.ebi.ac.uk/fgpt/gwas/

Published Genome-Wide Associations through 12/2012

PƵďůŝƐŚĞĚ�'t��Ăƚ�ƉчϱyϭϬ-8 for 17 trait categories

Page 6: Simulating Genes in Genome-wide Association Studies

Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002

PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579

doi:10.1371/journal.pbio.1000579Wray et al.

Page 7: Simulating Genes in Genome-wide Association Studies

Unsurprisingly, since the GWAS method is primarily poweredfor common alleles, risk allele frequencies were well above 5%(median risk allele frequency 36%, interquartile range, IQR,21%!53%) in the populations analyzed as well as in theHapMap populations (CEU: 37%, 21!54%; YRI: 33%,13!65%; combined JPT"CHB 32%, 13!58%; Fig. S1).

The 531 reported SNP-trait associations represented 465unique TASs; 43% (n # 199) of which were located in anintergenic region, 45% (n # 208) were intronic, 9% (n # 41)were nonsynonymous, 2% (n # 10) were in a 5$ or 3$ untrans-lated region, and 2% (n # 7) were synonymous, according to theUniversity of California Santa Cruz Genome Browser (5).Discrete traits were the focus of 227 (43%) of the 531 SNP-traitassociations, which had associated odds ratios (ORs) rangingfrom 1.04 to 29.4 (median 1.33, IQR 1.20–1.61; Fig. S2). Amongthe discrete traits, the range of ORs was similar betweennonsynonymous and other TASs; however, the right tail of theOR distribution for nonsynonymous TASs was slightly skewedtoward higher valves. The highest ORs were reported for pig-mentation traits [Fig. 1; MC1R and hair color (6) and OCA2 andeye color (6)]. SNP-trait associations were also distributed widelyacross diseases of high population prevalence, including heartdisease, obesity, diabetes, and cancer (Table S2). Trait preva-lence was not associated with the magnitude of ORs and riskallele frequencies, which were similar between the 10 mostprevalent traits and all others combined (median ORs 1.26 and1.29, respectively; median risk allele frequencies, 40% and 35%,respectively).

Among genes or regions harboring TASs that were reportedin multiple studies of discrete traits, 18 were associated withseemingly distinct traits that may suggest clues toward commonetiologic pathways (Table 1). Several TASs were located inpreviously characterized candidate genes, such as APOE, HLA,KCNJ11, PPARG, and CARD15, and were detected throughGWAS at comparable effect sizes and stronger levels of statis-tical significance (Table S3). In these instances, GWAS-identified SNPs served as reasonable positive controls for knowndisease-associated genetic variants.

Functional Analysis. To assess the underlying functionality at thetrait/disease-associated genetic loci, we systematically mapped

all TASPs (reported index TASs with an association p value %5.0 & 10!8 and all HapMap phase II CEU SNPs in LD [r2 ' 0.9])to 20 nonmutually exclusive genomic annotation sets (Table S4).For each annotation set, we did the following. For every uniqueTAS block, we determined whether any TASPs mapped to theannotation set. If none mapped, we did not count the block.However, if one or more TASPs mapped, then we counted 1 perblock. To compute the odds of a TAS block mapping to theannotation set, we divided the number of unique TAS blocks thatwere counted in the annotation set (n) by the number of TASblocks that were not counted (N!n). To evaluate whether anyannotation set was significantly enriched or depleted for TASblocks, we compared the observed odds with the expected oddscalculated from 100 control datasets comprised of randomlyselected SNPs and their LD partners. Importantly, the mappingand counting strategies were consistent across both the test andthe control datasets to ensure a fair comparison. Further, thegeneration of the control datasets took into account the repre-sentation biases on the genotyping arrays that were used toidentify the TASs (SI Text).

For 9 annotation sets (nonsynonymous sites, 1kb promoters,5kb promoters, most conserved sequences (MCSs), 3$ UTRs,microRNA target sites, Introns, CpG islands and experimentallyvalidated regulatory regions from ORegAnno), the 95% confi-dence interval (CI) of the OR excluded 1.0 and the enrichmentp values were %0.05 (Fig. S3), indicating that these categoriesmay be significantly enriched for TAS blocks. Nonsynonymoussites had the strongest signal for enrichment (OR # 3.9[2.2!7.0], p # 3.5 & 10!7). After restricting the analysis to onlythose nonsynonymous SNPs predicted by PolyPhen (7) to bepotentially deleterious (which reduces the sample size by ap-proximately 65%), TAS blocks were even more strongly enriched(OR # 5.2 [1.8–15.3], p # 0.001). Thirty nonsynonymous TASPsthat are predicted to be potentially deleterious [by PolyPhen andan unpublished method, CDPred (P. Cherukuri and J. Mullikin,personal communication)] were identified as attractive candi-dates for functional follow-up (Table 2).

To examine the possibility that signals in other annotation setsmight not represent bona fide TAS block enrichment, but rathera ‘‘hitchhiking’’ effect whereby TASPs closely linked with non-

OCA2, eye color

MC1R, hair color

LOXL1, exfoliation glaucoma1

25

1020

30O

dds

Rat

io

0 20 40 60 80 100Reported risk allele frequency, %

Fig. 1. Published odds ratios for discrete traits by reported risk allele frequencies. Labeled SNP-trait associations are those with the highest ORs. Note that they axis is on the log scale.

Hindorff et al. PNAS ! June 9, 2009 ! vol. 106 ! no. 23 ! 9363

GEN

ETIC

S

www.pnas.org/cgi/doi/10.1073/pnas.0903103106Hindorff et al.

Page 8: Simulating Genes in Genome-wide Association Studies

effects, we believe that current approaches in quantitativegenetics coupled with gathering adequate data will dissectadditional genetic variance within populations. The goal ofexplaining heritable effects is not purely academic. Untilmore of the variation expected from family studies isexplained by direct analysis of the genome, there remainsthe possibility that there is a fundamental misunderstand-ing in our knowledge and conceptual framework. Identifi-cation of specific genomic variants that underpinindividual differences provides the foundation for predic-tion, risk profiling, and personalized medicine; for identi-fying pathways and new potential drug targets; forclassifying disease subtypes; for improving and maintain-ing food sources; and for understanding the influence ofselection and the maintenance of diversity in the naturalworld.

Complex trait variationThe genomic variation observed within a population is theresult of the evolutionary forces of mutation, genetic drift,

recombination, and natural selection in the evolutionarypast [24], which is something that we do not know, partic-ularly given the extent of pleiotropy across traits (Box 1). Arange of genetic architectures, in terms of the exact num-ber, effect size, and frequency of causal variants, may beconsistent with current findings in humans [30]. Linkagestudies and GWAS have identified many thousands ofsignificant associations across more than 500 human phe-notypes (Box 1), and it is clear that, for any given trait,genetic variance is likely contributed from a large numberof loci across the entire allele frequency spectrum.

Some researchers suggest that ‘synthetic associations’,where associations at common single nucleotide polymor-phisms (SNPs) reflect linkage disequilibrium (LD) withmultiple rare variants, underlie many GWAS results andthat drawing conclusions regarding genetic architecturefrom GWAS is not justified [31,32]. Although there areexamples of ‘synthetic associations’ [33], they cannot ex-plain all GWAS results [3,34,35]. Converging lines of evi-dence suggest a contribution from variants of >5%

Box 1. The distribution of genetic variants across allele frequency

The variance explained by a single causal variant depends upon itseffect size and its frequency within the population. Under neutralityand random mating, the allele frequency distribution is approximatelyproportional to Equation I [23]:

1=½ pð1 # pÞ%; [I]and the genetic variance contributed by a single variant is (Equation

II):

2 pð1 # pÞa2; [II]where p is the frequency of the causal variant and a is the effect size

on an arbitrary scale. Under a neutral model, this implies that mostvariants are rare, but most of the genetic variance is due to commonvariants [24].

The effect of directional selection is to increase the amount ofvariation explained by rare variants, because natural selection shouldminimize the frequency of deleterious variants in the population [24].Therefore, for any phenotype, many causal variants will be rare, andthe proportion of population-level genetic variance in complexphenotypes attributable to variants across the allele frequencyspectrum will depend upon the strength of selection in our evolu-tionary past. The problem is that this is something that we do not

know. Additionally, newly arising mutations can have pleiotropiceffects on multiple phenotypes and the effect (size and/or direction) ofa given mutation may not be the same for all traits. Moreover, each ofthe traits affected may be associated with fitness in different waysand, thus, held at frequencies that are intermediate between twophenotypes (e.g., balancing selection).

The distribution of GWAS findings to date, obtained from thePublished GWAS Catalogue, across allele frequency is shown inFigure I for studies from 2008 on a selection of traits each of which isgiven a different color. For quantitative traits (Figure IA), the absoluteeffect is plotted against the minor allele frequency, and for complexcommon diseases (Figure IB), the odds ratio is plotted against the riskallele frequency. Each of the 38 quantitative traits and 43 disease traitsare represented by different colors. There an ascertainment bias inthat the power of detection is proportional to pa2, but it is clear that,for each complex trait, variance is contributed from the entire allelefrequency spectrum. This highlights the scarcity of low-frequencyvariants identified by GWAS for quantitative traits and complexdisease in humans. Detecting these variants will require a combina-tion of greater sample size, better genotyping, and improvedphenotyping.

Minor allele frequency

(A) (B)

Abso

lute

effe

ct (S

D un

its)

<0.001 0.01 0.1 0.5

01

35

Risk allele frequencyOd

ds ra"o

<0.001 0.01 0.1 0.5 1

15

10TRENDS in Genetics

Figure I. For quantitative traits (A), the absolute effect is plotted against the minor allele frequency, whereas for complex common diseases (B), the odds ratio is plottedagainst the risk allele frequency. Each of the 38 quantitative traits and 43 disease traits are represented by different colors. Abbreviation: SD, standard deviation.

Opinion Trends in Genetics xxx xxxx, Vol. xxx, No. x

TIGS-1106; No. of Pages 9

2

http://dx.doi.org/10.1016/j.tig.2014.02.003Robinson et al.

Page 9: Simulating Genes in Genome-wide Association Studies

provide important clues about the evolutionary history andunderlying molecular mechanisms of certain TASPs.

Several limitations of the underlying catalog data should benoted. We extracted all eligible associations from publishedarticles and SI Text, but the number and quality of reported SNPassociations is dependent upon the preferences of the individualauthor and journal. Also, the studies within the catalog generallytest only those SNPs that are detectable via commonly usedgenotyping platforms in participants who tend to be fromEuropean-descent populations. The GWAS data are likely to besubject to varying degrees of upward bias in effect size estimates(the ‘‘winner’s curse’’ phenomenon), particularly to the extentthat estimates from the GWAS discovery population, who maybe less representative of the general population, influence thosereported in our catalog. Nonetheless, in several instances inwhich known candidate SNPs have been previously identified,GWAS of the same trait tended to confirm these findings withsimilar effect sizes and stronger levels of statistical significance.Finally, TASs reported in published GWAS suffer from ‘‘leadTAS bias’’; generally 1 or 2 TASs out of a cluster are selectedfrom the initial study, often based on likely functional signifi-cance such as a conserved nonsynonymous site, for associationanalysis in the replication sample. To minimize the effect of thisbias, we analyzed TAS blocks, which include the lead SNPs andtheir known LD partners based on HapMap phase II data.However, the true impact of the bias is difficult to quantify andit may still exert a slight effect on the enrichment/depletionsignals especially for categories such as nonsynonymous sites.

An important question is to what extent GWAS have identifiedgenetic variants likely to be of clinical or public health importance,particularly for developing preventive or therapeutic interventions.Answering this question must await better functional characteriza-tion of TASs or the true causative variants they may be tagging,evidence of effective interventions, and identification of potentialmodifiers of SNP-trait associations (1). However, the current studycontributes empiric bounds on the expectations for the effect sizesand allele frequencies of TASs that can be identified from GWAS.It also highlights the distribution of promising SNP-trait associa-tions across a wide variety of traits of substantial public healthinterest, such as obesity, hypertension, coronary artery disease, andcancer. Our results may guide future studies by highlighting geneticvariants that are of particular interest from a descriptive, associa-tion, evolutionary, or functional perspective (such as predictions ofTASP-mediated allele-specific transcription factor binding sites)and suggesting hypotheses for future study. Our description ofGWAS-identified variants builds upon the important work previ-ously targeted toward candidate genes, adding to a more completepicture of the contribution of common genetic variation to commondiseases. It is clear, however, that the proportion of heritabilityexplained by common variation for most common diseases to dateis modest at best (17). As the power of the GWAS approachincreases with access to more samples, and as the types of methodsto test for genetic associations expand to include copy numbervariants and rarer alleles, more associations will likely be identifiedand timely analyses similar to those presented here will continue toupdate our knowledge of the influence of genomic structure andfunction on complex diseases.

1

2

3

456789

10

Odd

s R

atio

Non−s

ynon

ymou

s sites

Promote

rs (1k

b)

Promote

rs (5k

b)

5’ UTRs

3’ UTRs

miRTS

Intron

ic reg

ions

Interg

enic

region

s

Interg

enic

TFBSs

CpG is

lands

PReMod

sites

ORegAnn

o elem

ents

EAR regio

nsMCSs

HARsPSGs

Annotation Set

Enrichment/depletion analysis after adjusting for ’hitchhiking’ effects from non−synonymous sites

Fig. 2. Odds ratios for TAS block enrichment/depletion analysis after adjusting for ‘‘hitchhiking’’ effects from nonsynonymous sites. Four annotation sets (Splicesites, Validated enhancers, EvoFold elements, and noncoding RNAs) are not represented here because no TAS blocks mapped to these annotation sets. The bluecircle represents the point estimate of the odds ratio (OR) and the red lines represent the 95% CI. Possible ‘‘hitchhiking’’ effects from nonsynonymous sites arereduced by discarding any TASP/control SNP in r2 ! 0.6 with a nonsynonymous SNP. For an explanation of the annotation sets on the x axis, we refer the readerto Table S4. Note that the y axis is on the log scale. Nonsynonymous OR computation is not adjusted for ‘‘hitchhiking’’ effects.

9366 ! www.pnas.org"cgi"doi"10.1073"pnas.0903103106 Hindorff et al.

www.pnas.org/cgi/doi/10.1073/pnas.0903103106Hindorff et al.

Page 10: Simulating Genes in Genome-wide Association Studies

Observation Interpretation

Missing H Lots

Uniform frequencies of “hits” Common associations exist

Rare hits have larger OR Rare alleles may have larger effects

Larger OR in genes Genes matter

Page 11: Simulating Genes in Genome-wide Association Studies

Observation Interpretation

Rare hits have larger OR

Rare alleles may have larger effects

Disease is harmful with respect to fitness

(in the evolutionary sense).

Larger OR in genes Genes matter

Page 12: Simulating Genes in Genome-wide Association Studies

0.4 0.020

0.015

0.010

0.005

a b

0.3

Freq

uenc

y of

obs

erva

tion

s

Cau

sal v

aria

nt fr

eque

ncy

0.2

0.1

00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5

Oddsratio

4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[

5KOWNCVGF�JKVU)9#5�JKVU Ű 2

Ű 3

Ű 4

Ű 5

Ű 6

Ű 7

Ű 8

Ű 9

> 9

Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.

DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.

Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.

Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.

the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.

Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is

nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.

Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance

REVIEWS

140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics

© 2012 Macmillan Publishers Limited. All rights reserved

doi:10.1038/nrg3118Gibson

Page 13: Simulating Genes in Genome-wide Association Studies

0.4 0.020

0.015

0.010

0.005

a b

0.3

Freq

uenc

y of

obs

erva

tion

s

Cau

sal v

aria

nt fr

eque

ncy

0.2

0.1

00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5

Oddsratio

4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[

5KOWNCVGF�JKVU)9#5�JKVU Ű 2

Ű 3

Ű 4

Ű 5

Ű 6

Ű 7

Ű 8

Ű 9

> 9

Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.

DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.

Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.

Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.

the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.

Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is

nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.

Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance

REVIEWS

140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics

© 2012 Macmillan Publishers Limited. All rights reserved

The multiplicative model

G =Y

i

(1 + ei)

Risch & colleagues, Pritchard, countless others

Page 14: Simulating Genes in Genome-wide Association Studies

The multiplicative model

G =Y

i

(1 + ei)

0 2 4 6 8 10

02

46

810

Causative mutations on paternal allele

Cau

sativ

e m

utat

ions

on

mat

erna

l alle

le

0.2 0.4

0.6

0.8

1

1.2

1.4

Risch & colleagues, Pritchard, countless others

Page 15: Simulating Genes in Genome-wide Association Studies

WWHD?(What would Haldane do?)

p2 2pq q2

1 1� sh 1� 2s

Genotype AA Aa aa

Mating frequency

Fitness

q̂ =u

sh

q̂ ⇡r

u

sas h ! 0

DOI: 10.1017/S0305004100015644

Haldane

Page 16: Simulating Genes in Genome-wide Association Studies

Mutation at rate u (per gamete per generation)

“A” allele

X

X

X

“a” allele is heterogeneous

in its molecular origin

trans-heterozygotes are at risk. Phenotype has (weak) effect on individual fitness

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 17: Simulating Genes in Genome-wide Association Studies

E↵ect sizes ⇠ Exp(�)

0.0

2.5

5.0

7.5

0.0 0.3 0.6 0.9Effect size

dens

ity

= effect of haplotype. Additive over causative mutations

hi

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 18: Simulating Genes in Genome-wide Association Studies

Gij =p

hi ⇥ hj

(geometric mean)

0 2 4 6 8 10

02

46

810

Causative mutations on paternal allele

Cau

sativ

e m

utat

ions

on

mat

erna

l alle

le

0.05 0.1

0.15

0.2 0.25

0.3 0.35

0.4

Pi,j = Gi,j +N(0,�)

w = e�(Pi,j)

2

2�2S

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 19: Simulating Genes in Genome-wide Association Studies

Aside: simulation tools

• C++ library for rapid forward simulation

• Available from https://github.com/molpopgen/fwdpp

• Preprint on arXiv at http://arxiv.org/abs/1401.3786

Page 20: Simulating Genes in Genome-wide Association Studies

1e−0

31e−0

21e−0

11e

+00

1e+0

1

θ = ρ = 100

Population size (N diploids)

Mea

n ru

n tim

e (d

ays)

1000 10000 50000

sfs_codeSLiMfwdpp (gamete−based)fwddpp (individual−based)

0.00

50.

020

0.05

00.

200

0.50

02.

000

5.00

0 θ = ρ = 500

Population size (N diploids)

1000 10000 50000

510

2050

100

200

500

1000

Population size (N diploids)

Mea

n pe

ak m

emor

y us

e (M

b)

1000 10000 50000

1020

5010

020

050

010

00

Population size (N diploids)

1000 10000 50000

http://arxiv.org/abs/1401.3786Thornton

Page 21: Simulating Genes in Genome-wide Association Studies

2Nsh = 1 2Nsh = 10 2Nsh = 100

0

5

10

15

20

0.1 0.5 1 0.1 0.5 1 0.1 0.5 1Proportion of new mutations that are deleterious

Mea

n ru

n tim

e (h

ours

)

Simulation

fwdpp (gamete−based)

fwdpp (individual−based)

SLiM

2Nsh = 1 2Nsh = 10 2Nsh = 100

0

50

100

150

0.1 0.5 1 0.1 0.5 1 0.1 0.5 1Proportion of new mutations that are deleterious

Mea

n pe

ak m

emor

y us

e (m

egab

ytes

)

http://arxiv.org/abs/1401.3786Thornton

Page 22: Simulating Genes in Genome-wide Association Studies

Selection is weak●●● ● ● ● ● ● ● ● ●

0.0 0.1 0.2 0.3 0.4 0.5

0.70

0.80

0.90

1.00

Mean effect size (λ)

Rel

ative

fitn

ess

● Population mean fitnessAverage fitness of a caseAverage minimum fitness

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 23: Simulating Genes in Genome-wide Association Studies

Heritability plateaus

●● ●

● ● ●●

0.0 0.1 0.2 0.3 0.4 0.5

0.00

0.02

0.04

0.06

Mean effect size (λλ)

Broa

d−se

nse

herit

abilit

y

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 24: Simulating Genes in Genome-wide Association Studies

Rare alleles

0.0

0.2

0.4

Derived allele frequency

Prop

ortio

n

1 5 10

●●

● ● ● ● ● ● ● ●

� = 0.25

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 25: Simulating Genes in Genome-wide Association Studies

GWAS have poor power

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

Mean effect size (λ)

Powe

r

GWASGWAS,no recombinationresequencingresequencingno recombination

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 26: Simulating Genes in Genome-wide Association Studies

Compare model to data…0.4 0.020

0.015

0.010

0.005

a b

0.3

Freq

uenc

y of

obs

erva

tion

s

Cau

sal v

aria

nt fr

eque

ncy

0.2

0.1

00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5

Oddsratio

4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[

5KOWNCVGF�JKVU)9#5�JKVU Ű 2

Ű 3

Ű 4

Ű 5

Ű 6

Ű 7

Ű 8

Ű 9

> 9

Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.

DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.

Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.

Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.

the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.

Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is

nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.

Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance

REVIEWS

140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics

© 2012 Macmillan Publishers Limited. All rights reserved

doi:10.1038/nrg3118

Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002

PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579

doi:10.1371/journal.pbio.1000579Gibson Wray et al.

Page 27: Simulating Genes in Genome-wide Association Studies

…reveals a pretty good fit

Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002

PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579

doi:10.1371/journal.pbio.1000579Wray et al.

02

46

810

MAF of most significant marker(in cases)

Mea

n nu

mbe

r of m

arke

rs

n = 36.899

0 0.1 0.2 0.3 0.4 0.5

� = 0.05

(Based on simulating imperfect SNP chips)

Page 28: Simulating Genes in Genome-wide Association Studies

“Burden” tests do badly…

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Mean effect size (λ)

Powe

r

GWASGWASno recombinationResequencingResequencingno recombination

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Mean effect size (λ)

Powe

r

50 markers50 markersno recombination100 markers100 markersno recombination200 markers200 markersno recombination250 markers250 markersno recombination

Madsen and Browning (2009)

Li and Leal (2008)

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 29: Simulating Genes in Genome-wide Association Studies

…because the model is wrong.

●●

●●

●●

0.0 0.1 0.2 0.3 0.4 0.5

02

46

8

Mean effect size (λ)

Mea

n nu

mbe

r of c

ausa

tive

mut

atio

ns p

er d

iplo

id

●●

●●

●●

ControlsCasesControls (rares)Cases (rares)

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 30: Simulating Genes in Genome-wide Association Studies

SKAT does ok

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Mean effect size (λ)

Powe

r

Resequencing, default weights and optimal p−valuesGWAS, default weights and optimal p−valuesResequencing, Madsen−Browning weights and optimal p−valuesGWAS, Madsen−Browning weights and optimal p−values

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 31: Simulating Genes in Genome-wide Association Studies

Manhattan plots

0 20 40 60 80 100

05

1015

Position (kbp)

−lo

g 10(p)

CommonCommon, causativeRareRare, causative

0 20 40 60 80 100

05

1015

Position (kbp)

−lo

g 10(p)

CommonCommon, causativeRareRare, causative

Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci show muchlarger differences (Fig. 2a and Supplementary Fig. 6).

Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows the way in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, and within-UK differentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from

evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.

There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).

The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.

Disease association results

We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a

−log

10(P

)

0

5

10

15

Chromosome

22 X212019181716151413121110987654321

3020

20

100

0

40

80

60

40

100

Obs

erve

d te

st s

tatis

tic

Expected chi-squared value

a

b

Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described in Methods. Green dots indicate SNPs with aP value ,1 3 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).

Table 1 | Highly differentiated SNPs

Chromosome Genes Region (Mb) SNP Position P value

2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213

4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212

4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208

6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213

6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211

9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207

11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208

11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208

12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8

4.37–4.82 rs10774241 45,537,27 2.73 3 10208

14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207

19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,

RSHL1,SYMPK,FOXA3

50.84–51.09 rs3760843 50,980,546 4.19 3 10207

20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209

Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207

Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are given with details of theSNP with the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.

NATURE | Vol 447 | 7 June 2007 ARTICLES

663Nature ©2007 Publishing Group

doi:10.1371/journal.pgen.1003258 doi:10.1038/nature05911Burton et al.Thornton et al.

Page 32: Simulating Genes in Genome-wide Association Studies

A new association test

Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci show muchlarger differences (Fig. 2a and Supplementary Fig. 6).

Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows the way in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, and within-UK differentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from

evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.

There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).

The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.

Disease association results

We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a

−log

10(P

)

0

5

10

15

Chromosome

22 X2120191817161514131211109876543213020

20

100

0

40

80

60

40

100

Obs

erve

d te

st s

tatis

tic

Expected chi-squared value

a

b

Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described in Methods. Green dots indicate SNPs with aP value ,1 3 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).

Table 1 | Highly differentiated SNPs

Chromosome Genes Region (Mb) SNP Position P value

2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213

4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212

4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208

6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213

6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211

9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207

11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208

11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208

12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8

4.37–4.82 rs10774241 45,537,27 2.73 3 10208

14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207

19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,

RSHL1,SYMPK,FOXA3

50.84–51.09 rs3760843 50,980,546 4.19 3 10207

20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209

Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207

Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are given with details of theSNP with the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.

NATURE | Vol 447 | 7 June 2007 ARTICLES

663Nature ©2007 Publishing Group

ESMK =i=KX

i=1

✓�log10(pi) + log10

i

K

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 33: Simulating Genes in Genome-wide Association Studies

ESM is a more powerful test

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

Mean effect size (λ)

Powe

rGWASGWAS,no recombinationresequencingresequencingno recombination

(Caveat: requires permutation to get p-values)doi:10.1371/journal.pgen.1003258

Thornton et al.

Page 34: Simulating Genes in Genome-wide Association Studies

Running ESM on real data

• We think we can implement ESM using a mix of the PLINK toolkit plus some custom programs.

• We need data to test it out on.

• There are very few modern GWAS available for reanalysis.

• Lack of data sharing hurts the field.

Page 35: Simulating Genes in Genome-wide Association Studies

Rare alleles and missing heritability

• Current tests are underpowered

• Heterogeneity means that GWAS “hits” tag few causative mutations

• Causative mutations that are tagged tend to be (relatively) common. These “common” mutations have effect sizes much smaller than the typical causative mutation that segregates

Page 36: Simulating Genes in Genome-wide Association Studies

●●● ●

●● ●● ●

●●

●● ● ● ●

●●

●●

●●

●● ● ● ● ●●

●● ●

●●

●●

●●

●● ●

●●

●●

0.010 0.025

0.050 0.075

0.100 0.125

0.175 0.250

0.350 0.500

0.0000

0.0015

0.0030

0.0000

0.0015

0.0030

0.0000

0.0015

0.0030

0.0000

0.0015

0.0030

0.0000

0.0015

0.0030

0 1 2 0 1 2Number of copies of derived allele at focal SNP

Mea

n nu

mbe

r of c

ausa

tive

sing

leto

ns p

er in

divi

dual

Focal SNP●

Most significant markerUnassociated SNP

doi:10.1371/journal.pgen.1003258Thornton et al.

Page 37: Simulating Genes in Genome-wide Association Studies

Population growth

TimePresentPast

Popu

latio

n si

ze

Page 38: Simulating Genes in Genome-wide Association Studies

H^2 insensitive to growth

● ●

●●

0.01

0.02

0.03

0.04

0.0 0.1 0.2 0.3 0.4 0.5Average effect size of new mutation

Mea

n br

oad−

sens

e he

ritab

ility

model● constant

growth

Unpublished

Page 39: Simulating Genes in Genome-wide Association Studies

Consistent with recent findings from other groups

©20

14 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

2 ADVANCE ONLINE PUBLICATION NATURE GENETICS

A N A LY S I S

But despite these substantial shifts in the overall frequency spectrum, the impact on genetic load—namely, the mean number of deleterious variants per individual and thus the average fitness—is much more subtle.

In the semidominant case, the individual burden is essentially unaffected by these demographic events (Fig. 1c,d). With growth, the increased number of segregating sites is balanced exactly by a decrease in the mean frequency (with the converse being true for the bottleneck model) so that the number of variants per individual stays constant. This kind of balance is predicted by classic mutation-selection balance models18 and can be shown to hold for general changes in population size, provided that selection is strong and deleterious alleles are at least partially dominant (Supplementary Note).

The behavior of the recessive model is more complicated (Fig. 1e,f). In the bottle-neck model, the mean number of deleteri-ous variants per individual drops by 60% as a result of the bottleneck. This drop is due to the loss of rare alleles. However, during the bottleneck, some deleterious alleles drift to higher frequencies11,19, contributing dispro-portionately to the number of homozygotes. This causes a transient increase in the number of deleterious homozygous sites per individ-ual, i.e., the recessive load. Meanwhile, population growth has a less pronounced effect on recessive variation, leaving the mean number of deleterious alleles per individual unchanged but causing a slight decrease in load.

More generally, the manner in which demography affects individual load varies with the degree of dominance and the strength of selection (Fig. 2, Supplementary Note and Supplementary Table 1). The behavior of these models can be classified into three selection regimes: strong, weak and effectively neutral. In the case of strong selection, i.e., where selection is much stronger than drift (approximately s 10−3 for semidominant mutations), deleterious variants are extremely unlikely to fix, and virtu-ally all of the genetic load is due to segregating variation. In this range, we infer that human demography has had no impact on semidominant load (and, more generally, for mutations with at least some dominance component) and has had only small effects on recessive load.

The case of weak selection—where drift and selection have compa-rable effects—is more complex, as fixed alleles may contribute appreci-ably to load, and the steady-state load depends on population size20. However, the approach to the steady state is very slow, being limited by both the time to fixation (on the order of 4N generations) and the muta-tional input (on the order of 1/2NU generations, where U is the muta-tion rate). For both the semidominant and recessive cases, population growth is too recent to have substantially decreased the load. Recent growth increases the input of new deleterious mutations, but this effect

is counterbalanced by the fact that the new deleterious mutations are proportionally rarer, as well as by the input of beneficial mutations. The bottleneck in Europeans is estimated to have occurred further in the past and at much lower population sizes5 (Supplementary Fig. 1), thus increasing its effect. In this case, the increase in drift causes segregating deleterious alleles to increase in frequency, sometimes reaching fixa-tion, and results in a slight increase in load (Supplementary Fig. 2). The out-of-Africa bottleneck should thus lead to a slight increase of load in Europeans, most notably for recessive sites.

In the effectively neutral range—where selection has negligible effects on the population dynamics—segregating variation contrib-utes negligibly, and hence the load does not change with demography. Thus, across all three selection regimes, recent human demographic history is likely to have had virtually no impact on genetic load at partially dominant sites and only weak effects at recessive sites.

Analysis of exome dataTo test these predictions, we analyzed two recent data sets of exome sequences from individuals of west African and European descent. Previous work comparing load in different populations has produced conflicting conclusions depending on the data set, choice of measures and functional annotations used. For example, Lohmueller et al.11 reported that there is “proportionally more deleterious variation in European than in African populations.” Similarly, Tennessen et al.5 found that European Americans had more nonreference genotypes when they used a conservative classification of deleterious sites but

a b

c d

e f

100

–1,000 0 1,000 2,000 3,000Time since beginning of bottleneck (generations)

–1,000 0 1,000 2,000 3,000Time since beginning of bottleneck (generations)

Time since beginning of growth (generations)

Time since beginning of growth (generations)

10,000

1,000

–1,000 0 1,000 2,000 3,000Time (generations)

Bottleneck

Pop

ulat

ion

size

100,000

10,000

Time (generations)

Growth

Pop

ulat

ion

size

–200 –100 0 100 200

102

104

Sem

idom

inan

tR

eces

sive

Num

ber

per

MB

100

102

104

100

102

104

Num

ber

per

MB

Num

ber

per

MB

100

102

104

Num

ber

per

MB

Number ofsegregating sites

Number of segregatingsites

Number of segregating sites

Number of deleteriousalleles per individual

Number of deleterious alleles per individual

Number of raredeleterious alleles

per individual

Number of rare deleterious allelesper individual

Number of segregating sites

Number of rare segregating sites

Number of rare segregatingsites

Number of rare segregating sites

Number of rare segregating sites

Load: number of deleterious alleles per individual

Load: number of homozygous sites per individual

Load: number of homozygous sitesper individual

Load: number of deleterious alleles per individual

Number of raredeleterious

alleles per individual

Number of rare deleterious alleles per individual

–200 –100 0 100 200

–200 –100 0 100 200

Figure 1 Time course of load and other key aspects of variation through a bottleneck and exponential growth. (a,b) The bottleneck (a) and exponential growth (b). (c–f) The expected number of variants and alleles per MB assuming semidominant mutations (c,d) or recessive mutations (e,f) with s = 1% and a mutation rate per site per generation of 10−8.

Simons et al.doi:10.1038/ng.2896

Page 40: Simulating Genes in Genome-wide Association Studies

Power is affected

0.00

0.02

0.04

0.06

0.08

0.000 0.025 0.050 0.075 0.100Effect size of segregating causative mutation

Freq

uenc

y in

pop

ulat

ion

ModelConstantGrowth

● ●

0.0

0.2

0.4

0.6

0.8

0.0 0.1 0.2 0.3 0.4 0.5Mean effect size of causative mutation

Powe

r

Statistic● ESM50

LogitSKAT

ModelConstantGrowth

Unpublished

Page 41: Simulating Genes in Genome-wide Association Studies

Excellent fit to empirical data

Frequency of most−associated marker

No.

mar

kers

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

1214

Unpublished

Page 42: Simulating Genes in Genome-wide Association Studies

Implications

• Power to detect regions with modest effects on risk (4-5% contribution to broad-sense heritability) is very low in growing populations

• The explanatory power of simple models is probably far from exhausted

Page 43: Simulating Genes in Genome-wide Association Studies

Implications

• Much more likely to detect loci with mutations of modest effect

• Underlying distribution of mean effect size across loci is completely unknown in any system

● ●

0.0

0.2

0.4

0.6

0.8

0.0 0.1 0.2 0.3 0.4 0.5Mean effect size of causative mutation

Powe

r

Statistic● ESM50

LogitSKAT

ModelConstantGrowth

Unpublished

Page 44: Simulating Genes in Genome-wide Association Studies

Future work• Multilocus models with epistasis

• Machine learning approaches: do they work?

• Develop new simulation tools

• Make simulation output available

• Implement ESM test for analyzing real GWAS data

Page 45: Simulating Genes in Genome-wide Association Studies

Other work in the lab• Copy number variation in Drosophila: doi: 10.1093/

molbev/msu124

• Detecting TE insertions using paired-end data in Drosophila: doi: 10.1093/molbev/mst129

• Modeling experimental evolution: doi: 10.1093/molbev/msu048

• Structural variation and variation in gene expression