evolutionary forces shaping human genetic variation #1000gb
DESCRIPTION
Talk I gave at the 1000 Genomes Project and Beyond meeting in Cambridge, UK.TRANSCRIPT
EVOLUTIONARY FORCES SHAPING HUMAN
GENETIC VARIATIONRyan D. Hernandez
[email protected] twitter: @rdhernand
http://www.finca.org
NileRiver
Red Sea
AndamanIslands
Meadowcroft19,000-12,000 years ago
Kennewick9,500 years ago
Spirit Cave9,500-9,400
years ago
Monte Verde14,800 years ago
Niah Cave40,000 years ago
Qafzeh100,000years ago
Lake Mungo45,000 years ago
Malakunanja50,000 years ago
Omo KibishOldest modern human195,000 years ago
Pestera cu Oase35,000 years ago
Yana River30,000 years ago
Zhoukoudian(Shandingdong)
11,000 years ago
Minatogawa18,000 years ago
Clovis13,500years ago
Klasies River Mouth120,000 years ago
EQUATOR
40,000-30,000years ago
20,000-15,000years ago
50,000years ago
15,000-12,000years ago
200,000 years ago
70,000-50,000 years ago
40,000years ago
AUSTRALIA
ASIA
AFRICA
EUROPE
NORTHAMERICA
SOUTHAMERICA
1
2
3
4
5
6
Migration date Generalized route
Human MigrationFossil or artifact site
40,000years ago
SOURCES: SUSAN ANTON, NEW YORK UNIVERSITY; ALISON BROOKS, GEORGE WASHINGTON UNIVERSITY; PETER FORSTER, UNIVERSITY OF CAMBRIDGE; JAMES F. O'CONNELL, UNIVERSITY OF UTAH; STEPHEN OPPENHEIMER, OXFORD UNIVERSITY; SPENCER WELLS, NATIONAL GEOGRAPHIC SOCIETY; OFER BAR-YOSEF, HARVARD UNIVERSITY
NGM MAPS
© 2006 National Geographic Society. All rights reserved.
Human colonization of
the world
http://ngm.nationalgeographic.com
Putatively neutral diversity levels The Effect of “Positive Selection”
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
“Selective Sweep”
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
Putatively neutral diversity levels The Effect of “Positive Selection”
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
“Selective Sweep”
Putatively neutral diversity levels The Effect of “Positive Selection”
“Selective Sweep”
• Repeated fixation of functional mutations in coding regions over evolutionary timescales can lead to a disproportional number of amino acid substitutions relative to observed polymorphisms.
• This can be summarized by a 2x2 table and analyzed using the McDonald-Kreitman test:
Non-Syn Syn
Fixed F F
Polymorphic P P1000 Genomes Project Data
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
Putatively neutral diversity levels The Effect of “Positive Selection”
SnIPRE: an improvement to MKT
Since few SNPs and substitutions are usually observed per gene, MKT can be noisy. Pooling observations across the genome using a mixed effects model vastly increases power.
Eilertson et al, 2012
SnIPREASR in 1000 Genomes Project
Human-chimp divergencePos Sel Conserved
410 8027
• Conserved genes are either neutral or under purifying selection.
SnIPREASR: an improvement to SnIPRE
• Alignments are generated using MOSAIC, a program we developed that rigorously integrates putative orthologs from an arbitrary number of sources.!!
• Using PAML, we perform AIC-based model selection to infer the substitutions along the human lineage since our divergence with chimp.
pythonhosted.org/bio-‐MOSAIC/ Maher & Hernandez (arXiv)
HumanChimpOrangGorilla
…
Cyrus Maher
SnIPREASR works well for positive selection
• Simulations: Human-specific substitutions; Gutenkunst et al. demographic model.
• 𝛾 is the population scaled selection coefficient.
• SnIPREASR is best-powered to estimate values of 𝛾>0.
HumanChimpOrangGorilla
…
ASR removes genes positively selected in chimp
Human-chimp divergencePos Sel Conserved
Human only (ASR)
Pos Sel 343 0 343Conserved 67 8027 8094
410 8027
• Conserved genes are either neutral or under purifying selection.
• 67/410 (16%) of genes identified as positively selected when comparing human-chimp are conserved along the human lineage.
Positively selected genes dominated by smell & response to pathogens
GOrilla
Amino acid substitution
Neutral diversity
levels …
Reflects the fraction of amino acid
substitutions that are adaptive
n substitutions…Reflects the typical
strength of selection
The footprint of adaptive amino acid substitutions
• Goal: compare the pattern around amino acid substitutions to the pattern around synonymous substitutions.
Hernandez et al. Science (2011)
Observed Patterns of Diversity Around Human Substitutions
Hernandez et al. Science (2011)
Genetic diversity reduced: π=f0π0
(decrease in effective population size [Ne])
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
Putatively neutral diversity levels The Effect of Negative Selection
Genetic diversity reduced: π=f0π0
(decrease in effective population size [Ne])
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
Putatively neutral diversity levels The Effect of Negative Selection
Putatively neutral diversity levels The Effect of Negative Selection
Genetic diversity reduced: π=f0π0
(decrease in effective population size [Ne])
Adaptive Neutral
Nearly NeutralMildly DeleteriousFairly Deleterious
Strongly Deleterious
Background Selection in Humans
ObservedPredicted
}}
Hernandez, et al. Science (2011).
BGS correlates with Fst at neutral sites
www.postersession.com
4 - Demography and BGS Across 1000 Genomes !
4 - Population Differentiation as a Function of BGS!The decrease in Ne locally across the genome as a result of BGS (inferred2 by the value, B, in which lower values indicate stronger BGS) may impact the rate of genetic drift at specific loci. To investigate this effect, we measured FST between TGP populations as a function of BGS strength. Our results suggest that the strength of BGS is a predictor of population differentiation, with an increase in genetic drift driving this effect.
5 - Forward Simulations of Demography and BGS!
With the observations of thousands of genomes from global populations, worldwide patterns of human genetic diversity can be more fully described and the genome-wide impacts of natural selection on these patterns better understood. In many studies, genomic scans have revealed linked-selection, which manifests itself as a local trough in diversity, to be one important driver of such patterns. However, it less well understood how the actions of linked-selection interact with other drivers of population diversity, such as demography. Utilizing thousands of whole-genomes from phase 3 of the 1000 Genomes Project (TGP) 1, we investigate the impact of one mode of linked-selection, background selection (BGS), on population specific patterns of genetic diversity as a function of demographic history. Additionally, we also observe that BGS influences the site-frequency spectrum (SFS), generating an increase towards the proportion of rare variants. This observation is especially crucial since many inferences of demography utilize the SFS and can be biased if they are performed on regions of the genome under strong BGS.
1 - Introduction!
Background Selection Influences Population Specific Patterns of Diversity and the Site-
Frequency Spectrum in Humans !!
Raul Torres and Ryan D. Hernandez!University of California, San Francisco!
2 - Background Selection!Through purifying selection, neutral variants linked to a negatively selected site will be eliminated from the population in a process called ‘background selection’ (BGS). Due to recombination and variation in the magnitude of selection on deleterious variants, the strength of BGS will vary across the genome, resulting in differing amounts of reduction in genetic diversity. Below, we show the variation in the strength of BGS across chromosome 22, as estimated by McVicker et al. 20082.
6 - Conclusions! The complex demographic history of humans has influenced the action of background selection on driving the patterns of diversity that exist within specific populations. Using data from the 1000 Genomes Project, we are able to recapitulate patterns that are evident by both forward simulations and population genetic theory. We also observe that linked selection, through the mechanism of BGS, can have a large impact on distorting the SFS. We caution that even strictly neutral sites can suffer from such distortions through the mere process of being on the same background as another site under selection and that care should be taken to account for such distortions when performing inference using the SFS.
References: 1. Abecasis G, et al. 2012. Nature. 491: 56-65. 2. McVicker G, et al. 2009. PLOS Genet. 5(5): e1000471. 3. Boyko AR, et al. 2008. PLOS Genet. 4(5): e1000083. 4. Gutenkunst RN, et al. 2009. PLOS Genet. 5(10): e1000695. 5. Hernandez RD. 2008. Bioinformatics. 24(23): 2786-2787. Acknowledgements: R.T. is supported by an NIGMS Initiative for Maximizing Student Development grant (R25-GM56847), an NSF Graduate Research Fellowship and a UCSF Earle C. Anthony Travel Award.
Na!
1.8 x Na!
0.28 x Na!
1.8 x Na! 3.4 x Na! 5.8 x Na!
0.15 x Na !
0.08 x Na !
AFR! EUR! ASN!
. . .�Deleterious locus:!!
Distribution of selection !coefficients: γ � Γ(α, β)
Length = 100 bp!
Uniform recombination
rate!
Demographic Model!Selection Model!
0.60
0.65
0.70
0.75
0.80
0.85
0.90
ππππ0
CONST AFR CEU ASNEUR! ASN!
Strength of Background Selection Across Chromosome 22
Position
Due
to B
ackg
roun
d Se
lect
ion
10000000 20000000 30000000 40000000
0.8
0.6
0.4
0.2
0
AFR!CONST!
Neutral Variant
Deleterious Variant
BGS!
Using a distribution of fitness effects and a demographic model inferred from previous studies3,4, we ran forward simulations using SFS_CODE5 to estimate the effect of human demography on determining the reduction in genetic diversity caused by BGS, observing that the effects of BGS are strongest for those populations that have experienced sharp population bottlenecks (i.e., Europeans and Asians). However, the expected reduction in diversity due to BGS across all human populations is still greater than for a simulated population of constant size, illustrating the importance of population expansions for determining the overall strength of BGS.
Estimate of Reduction in Diversity Due to BGS across African,
European, and Asian Populations!
π BG
S/π 0"
To observe if demography impacts the relative patterns of diversity (π) caused by BGS as predicted by our simulations, we measured the ratio of diversity across different bins of BGS strength, B, in the human genome in each of 20 populations from phase 3 of the TGP. We observe that BGS causes the greatest relative reduction of genetic diversity in non-African populations. To limit potential biases from coverage, high-coverage (exome) and low-coverage regions of the genome were analyzed separately.
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ● ● ●
●
●
●
●● ●
●●
●
● ●●
● ●
● ● ● ●
● ● ●
●
Fst (estimator method) vs. Background SelectionAfrican vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●●
●● ●
●
●●
● ●
●
●●
●
Fst (estimator method) vs. Background SelectionAfrican vs. European
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●● ●
●● ● ● ●
●
●
●
● ● ● ●●
●●
● ●
● ●
●●
●
Fst (estimator method) vs. Background SelectionEuropean vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
B value!
F ST!
BGS strength!
popu
latio
n di
ffere
ntia
tion!
F ST!
F ST!
B value! B value!
African vs. Asian! African vs. European! European vs. Asian!
High-Coverage (Exome)!
Position!
0.00
040.
0006
0.00
080.
0010
0.00
12
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
0.00
040.
0006
0.00
080.
0010
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
SASN!!
0.32
0.34
0.36
0.38
0.40
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
AFR!!
EUR!!
EASN!!
SASN!!
AFR!!
EUR!!
EASN!!
SASN!!
0.40
0.42
0.44
0.46
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
SASN!!
High-Coverage (Exome)!
Low-Coverage!Low-Coverage!
4 - BGS Skews the SFS Towards Rare Variants!Purifying selection on linked sites can cause distortions in gene geneologies, leading to potential skews in the site-frequency spectrum. To investigate these effects, we measured the SFS as a function of B separately across the high-coverage and low-coverage regions of phase 3 TGP populations. We observed a marked increase in the number of of rare variants, especially singletons, in both datasets as a function of BGS strength. This pattern is amplified in non-African vs. African populations.
Derived Allele Count (log-scale)!
frequ
ency!
frequ
ency!
0.0
0.1
0.2
0.3
0.4
0.5
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
CHS!
0.0
0.1
0.2
0.3
0.4
0.5
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
YRI!
0.0
0.1
0.2
0.3
0.4
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
TSI!
Low-!Coverage!
High-!Coverage!
1
ratiovec[1]
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
1.15
1.25
1.35
1.45 ●
Low−CoverageHigh−Coverage
Ratio of Singleton Frequency in Strong BGS Bin vs. Weak BGS Bin!
ITU!
AFR!!
EUR!!
EASN!!
SASN!!
ratio!
Estimate2 of the Strength of Background Selection Across Chromosome 22!
Frac
tion
of D
iver
sity!
Redu
ced
Due
to B
GS
(1-B)!
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0-24 225-249 475-499 725-749 975-979 0-24 225-249 475-499 725-749 975-979 0-24 225-249 475-499 725-749 975-979B value B value
www.postersession.com
4 - Demography and BGS Across 1000 Genomes !
4 - Population Differentiation as a Function of BGS!The decrease in Ne locally across the genome as a result of BGS (inferred2 by the value, B, in which lower values indicate stronger BGS) may impact the rate of genetic drift at specific loci. To investigate this effect, we measured FST between TGP populations as a function of BGS strength. Our results suggest that the strength of BGS is a predictor of population differentiation, with an increase in genetic drift driving this effect.
5 - Forward Simulations of Demography and BGS!
With the observations of thousands of genomes from global populations, worldwide patterns of human genetic diversity can be more fully described and the genome-wide impacts of natural selection on these patterns better understood. In many studies, genomic scans have revealed linked-selection, which manifests itself as a local trough in diversity, to be one important driver of such patterns. However, it less well understood how the actions of linked-selection interact with other drivers of population diversity, such as demography. Utilizing thousands of whole-genomes from phase 3 of the 1000 Genomes Project (TGP) 1, we investigate the impact of one mode of linked-selection, background selection (BGS), on population specific patterns of genetic diversity as a function of demographic history. Additionally, we also observe that BGS influences the site-frequency spectrum (SFS), generating an increase towards the proportion of rare variants. This observation is especially crucial since many inferences of demography utilize the SFS and can be biased if they are performed on regions of the genome under strong BGS.
1 - Introduction!
Background Selection Influences Population Specific Patterns of Diversity and the Site-
Frequency Spectrum in Humans !!
Raul Torres and Ryan D. Hernandez!University of California, San Francisco!
2 - Background Selection!Through purifying selection, neutral variants linked to a negatively selected site will be eliminated from the population in a process called ‘background selection’ (BGS). Due to recombination and variation in the magnitude of selection on deleterious variants, the strength of BGS will vary across the genome, resulting in differing amounts of reduction in genetic diversity. Below, we show the variation in the strength of BGS across chromosome 22, as estimated by McVicker et al. 20082.
6 - Conclusions! The complex demographic history of humans has influenced the action of background selection on driving the patterns of diversity that exist within specific populations. Using data from the 1000 Genomes Project, we are able to recapitulate patterns that are evident by both forward simulations and population genetic theory. We also observe that linked selection, through the mechanism of BGS, can have a large impact on distorting the SFS. We caution that even strictly neutral sites can suffer from such distortions through the mere process of being on the same background as another site under selection and that care should be taken to account for such distortions when performing inference using the SFS.
References: 1. Abecasis G, et al. 2012. Nature. 491: 56-65. 2. McVicker G, et al. 2009. PLOS Genet. 5(5): e1000471. 3. Boyko AR, et al. 2008. PLOS Genet. 4(5): e1000083. 4. Gutenkunst RN, et al. 2009. PLOS Genet. 5(10): e1000695. 5. Hernandez RD. 2008. Bioinformatics. 24(23): 2786-2787. Acknowledgements: R.T. is supported by an NIGMS Initiative for Maximizing Student Development grant (R25-GM56847), an NSF Graduate Research Fellowship and a UCSF Earle C. Anthony Travel Award.
Na!
1.8 x Na!
0.28 x Na!
1.8 x Na! 3.4 x Na! 5.8 x Na!
0.15 x Na !
0.08 x Na !
AFR! EUR! ASN!
. . .�Deleterious locus:!!
Distribution of selection !coefficients: γ � Γ(α, β)
Length = 100 bp!
Uniform recombination
rate!
Demographic Model!Selection Model!
0.60
0.65
0.70
0.75
0.80
0.85
0.90
ππππ0
CONST AFR CEU ASNEUR! ASN!
Strength of Background Selection Across Chromosome 22
Position
Due
to B
ackg
roun
d Se
lect
ion
10000000 20000000 30000000 40000000
0.8
0.6
0.4
0.2
0
AFR!CONST!
Neutral Variant
Deleterious Variant
BGS!
Using a distribution of fitness effects and a demographic model inferred from previous studies3,4, we ran forward simulations using SFS_CODE5 to estimate the effect of human demography on determining the reduction in genetic diversity caused by BGS, observing that the effects of BGS are strongest for those populations that have experienced sharp population bottlenecks (i.e., Europeans and Asians). However, the expected reduction in diversity due to BGS across all human populations is still greater than for a simulated population of constant size, illustrating the importance of population expansions for determining the overall strength of BGS.
Estimate of Reduction in Diversity Due to BGS across African,
European, and Asian Populations!
π BG
S/π 0"
To observe if demography impacts the relative patterns of diversity (π) caused by BGS as predicted by our simulations, we measured the ratio of diversity across different bins of BGS strength, B, in the human genome in each of 20 populations from phase 3 of the TGP. We observe that BGS causes the greatest relative reduction of genetic diversity in non-African populations. To limit potential biases from coverage, high-coverage (exome) and low-coverage regions of the genome were analyzed separately.
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ● ● ●
●
●
●
●● ●
●●
●
● ●●
● ●
● ● ● ●
● ● ●
●
Fst (estimator method) vs. Background SelectionAfrican vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●●
●● ●
●
●●
● ●
●
●●
●
Fst (estimator method) vs. Background SelectionAfrican vs. European
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●● ●
●● ● ● ●
●
●
●
● ● ● ●●
●●
● ●
● ●
●●
●
Fst (estimator method) vs. Background SelectionEuropean vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
B value!
F ST!
BGS strength!
popu
latio
n di
ffere
ntia
tion!
F ST!
F ST!
B value! B value!
African vs. Asian! African vs. European! European vs. Asian!
High-Coverage (Exome)!
Position!
0.00
040.
0006
0.00
080.
0010
0.00
12
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
0.00
040.
0006
0.00
080.
0010
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
SASN!!
0.32
0.34
0.36
0.38
0.40
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
AFR!!
EUR!!
EASN!!
SASN!!
AFR!!
EUR!!
EASN!!
SASN!!
0.40
0.42
0.44
0.46
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
SASN!!
High-Coverage (Exome)!
Low-Coverage!Low-Coverage!
4 - BGS Skews the SFS Towards Rare Variants!Purifying selection on linked sites can cause distortions in gene geneologies, leading to potential skews in the site-frequency spectrum. To investigate these effects, we measured the SFS as a function of B separately across the high-coverage and low-coverage regions of phase 3 TGP populations. We observed a marked increase in the number of of rare variants, especially singletons, in both datasets as a function of BGS strength. This pattern is amplified in non-African vs. African populations.
Derived Allele Count (log-scale)!
frequ
ency!
frequ
ency!
0.0
0.1
0.2
0.3
0.4
0.5
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
CHS!
0.0
0.1
0.2
0.3
0.4
0.5
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
YRI!
0.0
0.1
0.2
0.3
0.4
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
TSI!
Low-!Coverage!
High-!Coverage!
1
ratiovec[1]
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
1.15
1.25
1.35
1.45 ●
Low−CoverageHigh−Coverage
Ratio of Singleton Frequency in Strong BGS Bin vs. Weak BGS Bin!
ITU!
AFR!!
EUR!!
EASN!!
SASN!!
ratio!
Estimate2 of the Strength of Background Selection Across Chromosome 22!
Frac
tion
of D
iver
sity!
Redu
ced
Due
to B
GS
(1-B)!
B value
www.postersession.com
4 - Demography and BGS Across 1000 Genomes !
4 - Population Differentiation as a Function of BGS!The decrease in Ne locally across the genome as a result of BGS (inferred2 by the value, B, in which lower values indicate stronger BGS) may impact the rate of genetic drift at specific loci. To investigate this effect, we measured FST between TGP populations as a function of BGS strength. Our results suggest that the strength of BGS is a predictor of population differentiation, with an increase in genetic drift driving this effect.
5 - Forward Simulations of Demography and BGS!
With the observations of thousands of genomes from global populations, worldwide patterns of human genetic diversity can be more fully described and the genome-wide impacts of natural selection on these patterns better understood. In many studies, genomic scans have revealed linked-selection, which manifests itself as a local trough in diversity, to be one important driver of such patterns. However, it less well understood how the actions of linked-selection interact with other drivers of population diversity, such as demography. Utilizing thousands of whole-genomes from phase 3 of the 1000 Genomes Project (TGP) 1, we investigate the impact of one mode of linked-selection, background selection (BGS), on population specific patterns of genetic diversity as a function of demographic history. Additionally, we also observe that BGS influences the site-frequency spectrum (SFS), generating an increase towards the proportion of rare variants. This observation is especially crucial since many inferences of demography utilize the SFS and can be biased if they are performed on regions of the genome under strong BGS.
1 - Introduction!
Background Selection Influences Population Specific Patterns of Diversity and the Site-
Frequency Spectrum in Humans !!
Raul Torres and Ryan D. Hernandez!University of California, San Francisco!
2 - Background Selection!Through purifying selection, neutral variants linked to a negatively selected site will be eliminated from the population in a process called ‘background selection’ (BGS). Due to recombination and variation in the magnitude of selection on deleterious variants, the strength of BGS will vary across the genome, resulting in differing amounts of reduction in genetic diversity. Below, we show the variation in the strength of BGS across chromosome 22, as estimated by McVicker et al. 20082.
6 - Conclusions! The complex demographic history of humans has influenced the action of background selection on driving the patterns of diversity that exist within specific populations. Using data from the 1000 Genomes Project, we are able to recapitulate patterns that are evident by both forward simulations and population genetic theory. We also observe that linked selection, through the mechanism of BGS, can have a large impact on distorting the SFS. We caution that even strictly neutral sites can suffer from such distortions through the mere process of being on the same background as another site under selection and that care should be taken to account for such distortions when performing inference using the SFS.
References: 1. Abecasis G, et al. 2012. Nature. 491: 56-65. 2. McVicker G, et al. 2009. PLOS Genet. 5(5): e1000471. 3. Boyko AR, et al. 2008. PLOS Genet. 4(5): e1000083. 4. Gutenkunst RN, et al. 2009. PLOS Genet. 5(10): e1000695. 5. Hernandez RD. 2008. Bioinformatics. 24(23): 2786-2787. Acknowledgements: R.T. is supported by an NIGMS Initiative for Maximizing Student Development grant (R25-GM56847), an NSF Graduate Research Fellowship and a UCSF Earle C. Anthony Travel Award.
Na!
1.8 x Na!
0.28 x Na!
1.8 x Na! 3.4 x Na! 5.8 x Na!
0.15 x Na !
0.08 x Na !
AFR! EUR! ASN!
. . .�Deleterious locus:!!
Distribution of selection !coefficients: γ � Γ(α, β)
Length = 100 bp!
Uniform recombination
rate!
Demographic Model!Selection Model!
0.60
0.65
0.70
0.75
0.80
0.85
0.90
ππππ0
CONST AFR CEU ASNEUR! ASN!
Strength of Background Selection Across Chromosome 22
Position
Due
to B
ackg
roun
d Se
lect
ion
10000000 20000000 30000000 40000000
0.8
0.6
0.4
0.2
0
AFR!CONST!
Neutral Variant
Deleterious Variant
BGS!
Using a distribution of fitness effects and a demographic model inferred from previous studies3,4, we ran forward simulations using SFS_CODE5 to estimate the effect of human demography on determining the reduction in genetic diversity caused by BGS, observing that the effects of BGS are strongest for those populations that have experienced sharp population bottlenecks (i.e., Europeans and Asians). However, the expected reduction in diversity due to BGS across all human populations is still greater than for a simulated population of constant size, illustrating the importance of population expansions for determining the overall strength of BGS.
Estimate of Reduction in Diversity Due to BGS across African,
European, and Asian Populations!
π BG
S/π 0"
To observe if demography impacts the relative patterns of diversity (π) caused by BGS as predicted by our simulations, we measured the ratio of diversity across different bins of BGS strength, B, in the human genome in each of 20 populations from phase 3 of the TGP. We observe that BGS causes the greatest relative reduction of genetic diversity in non-African populations. To limit potential biases from coverage, high-coverage (exome) and low-coverage regions of the genome were analyzed separately.
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ● ● ●
●
●
●
●● ●
●●
●
● ●●
● ●
● ● ● ●
● ● ●
●
Fst (estimator method) vs. Background SelectionAfrican vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●●
●● ●
●
●●
● ●
●
●●
●
Fst (estimator method) vs. Background SelectionAfrican vs. European
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●● ●
●● ● ● ●
●
●
●
● ● ● ●●
●●
● ●
● ●
●●
●
Fst (estimator method) vs. Background SelectionEuropean vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
B value!
F ST!
BGS strength!
popu
latio
n di
ffere
ntia
tion!
F ST!
F ST!
B value! B value!
African vs. Asian! African vs. European! European vs. Asian!
High-Coverage (Exome)!
Position!
0.00
040.
0006
0.00
080.
0010
0.00
12
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
0.00
040.
0006
0.00
080.
0010
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
SASN!!
0.32
0.34
0.36
0.38
0.40
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
AFR!!
EUR!!
EASN!!
SASN!!
AFR!!
EUR!!
EASN!!
SASN!!
0.40
0.42
0.44
0.46
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
SASN!!
High-Coverage (Exome)!
Low-Coverage!Low-Coverage!
4 - BGS Skews the SFS Towards Rare Variants!Purifying selection on linked sites can cause distortions in gene geneologies, leading to potential skews in the site-frequency spectrum. To investigate these effects, we measured the SFS as a function of B separately across the high-coverage and low-coverage regions of phase 3 TGP populations. We observed a marked increase in the number of of rare variants, especially singletons, in both datasets as a function of BGS strength. This pattern is amplified in non-African vs. African populations.
Derived Allele Count (log-scale)!
frequ
ency!
frequ
ency!
0.0
0.1
0.2
0.3
0.4
0.5
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
CHS!
0.0
0.1
0.2
0.3
0.4
0.5
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
YRI!
0.0
0.1
0.2
0.3
0.4
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
TSI!
Low-!Coverage!
High-!Coverage!
1
ratiovec[1]
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
1.15
1.25
1.35
1.45 ●
Low−CoverageHigh−Coverage
Ratio of Singleton Frequency in Strong BGS Bin vs. Weak BGS Bin!
ITU!
AFR!!
EUR!!
EASN!!
SASN!!
ratio!
Estimate2 of the Strength of Background Selection Across Chromosome 22!
Frac
tion
of D
iver
sity!
Redu
ced
Due
to B
GS
(1-B)!
strong weak
• Neutral sites defined as PhyloP ⊂ (-1.2, 1.2)
Raul Torres
BGS in the human genome
Low Coverage WGS
High Coverage exome
www.postersession.com
4 - Demography and BGS Across 1000 Genomes !
4 - Population Differentiation as a Function of BGS!The decrease in Ne locally across the genome as a result of BGS (inferred2 by the value, B, in which lower values indicate stronger BGS) may impact the rate of genetic drift at specific loci. To investigate this effect, we measured FST between TGP populations as a function of BGS strength. Our results suggest that the strength of BGS is a predictor of population differentiation, with an increase in genetic drift driving this effect.
5 - Forward Simulations of Demography and BGS!
With the observations of thousands of genomes from global populations, worldwide patterns of human genetic diversity can be more fully described and the genome-wide impacts of natural selection on these patterns better understood. In many studies, genomic scans have revealed linked-selection, which manifests itself as a local trough in diversity, to be one important driver of such patterns. However, it less well understood how the actions of linked-selection interact with other drivers of population diversity, such as demography. Utilizing thousands of whole-genomes from phase 3 of the 1000 Genomes Project (TGP) 1, we investigate the impact of one mode of linked-selection, background selection (BGS), on population specific patterns of genetic diversity as a function of demographic history. Additionally, we also observe that BGS influences the site-frequency spectrum (SFS), generating an increase towards the proportion of rare variants. This observation is especially crucial since many inferences of demography utilize the SFS and can be biased if they are performed on regions of the genome under strong BGS.
1 - Introduction!
Background Selection Influences Population Specific Patterns of Diversity and the Site-
Frequency Spectrum in Humans !!
Raul Torres and Ryan D. Hernandez!University of California, San Francisco!
2 - Background Selection!Through purifying selection, neutral variants linked to a negatively selected site will be eliminated from the population in a process called ‘background selection’ (BGS). Due to recombination and variation in the magnitude of selection on deleterious variants, the strength of BGS will vary across the genome, resulting in differing amounts of reduction in genetic diversity. Below, we show the variation in the strength of BGS across chromosome 22, as estimated by McVicker et al. 20082.
6 - Conclusions! The complex demographic history of humans has influenced the action of background selection on driving the patterns of diversity that exist within specific populations. Using data from the 1000 Genomes Project, we are able to recapitulate patterns that are evident by both forward simulations and population genetic theory. We also observe that linked selection, through the mechanism of BGS, can have a large impact on distorting the SFS. We caution that even strictly neutral sites can suffer from such distortions through the mere process of being on the same background as another site under selection and that care should be taken to account for such distortions when performing inference using the SFS.
References: 1. Abecasis G, et al. 2012. Nature. 491: 56-65. 2. McVicker G, et al. 2009. PLOS Genet. 5(5): e1000471. 3. Boyko AR, et al. 2008. PLOS Genet. 4(5): e1000083. 4. Gutenkunst RN, et al. 2009. PLOS Genet. 5(10): e1000695. 5. Hernandez RD. 2008. Bioinformatics. 24(23): 2786-2787. Acknowledgements: R.T. is supported by an NIGMS Initiative for Maximizing Student Development grant (R25-GM56847), an NSF Graduate Research Fellowship and a UCSF Earle C. Anthony Travel Award.
Na!
1.8 x Na!
0.28 x Na!
1.8 x Na! 3.4 x Na! 5.8 x Na!
0.15 x Na !
0.08 x Na !
AFR! EUR! ASN!
. . .�Deleterious locus:!!
Distribution of selection !coefficients: γ � Γ(α, β)
Length = 100 bp!
Uniform recombination
rate!
Demographic Model!Selection Model!
0.60
0.65
0.70
0.75
0.80
0.85
0.90
ππππ0
CONST AFR CEU ASNEUR! ASN!
Strength of Background Selection Across Chromosome 22
Position
Due
to B
ackg
roun
d Se
lect
ion
10000000 20000000 30000000 40000000
0.8
0.6
0.4
0.2
0
AFR!CONST!
Neutral Variant
Deleterious Variant
BGS!
Using a distribution of fitness effects and a demographic model inferred from previous studies3,4, we ran forward simulations using SFS_CODE5 to estimate the effect of human demography on determining the reduction in genetic diversity caused by BGS, observing that the effects of BGS are strongest for those populations that have experienced sharp population bottlenecks (i.e., Europeans and Asians). However, the expected reduction in diversity due to BGS across all human populations is still greater than for a simulated population of constant size, illustrating the importance of population expansions for determining the overall strength of BGS.
Estimate of Reduction in Diversity Due to BGS across African,
European, and Asian Populations!
π BG
S/π 0"
To observe if demography impacts the relative patterns of diversity (π) caused by BGS as predicted by our simulations, we measured the ratio of diversity across different bins of BGS strength, B, in the human genome in each of 20 populations from phase 3 of the TGP. We observe that BGS causes the greatest relative reduction of genetic diversity in non-African populations. To limit potential biases from coverage, high-coverage (exome) and low-coverage regions of the genome were analyzed separately.
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ● ● ●
●
●
●
●● ●
●●
●
● ●●
● ●
● ● ● ●
● ● ●
●
Fst (estimator method) vs. Background SelectionAfrican vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●●
●● ●
●
●●
● ●
●
●●
●
Fst (estimator method) vs. Background SelectionAfrican vs. European
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●● ●
●● ● ● ●
●
●
●
● ● ● ●●
●●
● ●
● ●
●●
●
Fst (estimator method) vs. Background SelectionEuropean vs. Asian
0.10
0.12
0.14
0.16
0.18
0.20
0.22
0−24 225−249 475−499 725−749 975−1000
B value!
F ST!
BGS strength!
popu
latio
n di
ffere
ntia
tion!
F ST!
F ST!
B value! B value!
African vs. Asian! African vs. European! European vs. Asian!
High-Coverage (Exome)!
Position!0.
0004
0.00
060.
0008
0.00
100.
0012
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
0.00
040.
0006
0.00
080.
0010
● ● ● ● ●
● ● ● ● ●● ● ● ● ●
● ● ● ● ●
ESN
LWK
YRI
MSL
GW
D
IBS
CEU TS
I
GBR FI
N
CH
B
JPT
CH
S
KHV
CD
X
BEB
PJL
ITU
STU
GIH
●
B : 0−50B : 476−525B : 951−1000
SASN!!
0.32
0.34
0.36
0.38
0.40
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
AFR!!
EUR!!
EASN!!
SASN!!
AFR!!
EUR!!
EASN!!
SASN!!
0.40
0.42
0.44
0.46
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
(B=0−50)/(B=951−1000)
π stro
ng -B
GS/
πw
eak-
BG
S"
π"
AFR!!
EUR!!
EASN!!
SASN!!
High-Coverage (Exome)!
Low-Coverage!Low-Coverage!
4 - BGS Skews the SFS Towards Rare Variants!Purifying selection on linked sites can cause distortions in gene geneologies, leading to potential skews in the site-frequency spectrum. To investigate these effects, we measured the SFS as a function of B separately across the high-coverage and low-coverage regions of phase 3 TGP populations. We observed a marked increase in the number of of rare variants, especially singletons, in both datasets as a function of BGS strength. This pattern is amplified in non-African vs. African populations.
Derived Allele Count (log-scale)!
frequ
ency!
frequ
ency!
0.0
0.1
0.2
0.3
0.4
0.5
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
CHS!
0.0
0.1
0.2
0.3
0.4
0.5
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
CHS
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
0.5
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
Derived Allele Count (log-scale)!
0.0
0.1
0.2
0.3
0.4
ITU
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
0.0
0.1
0.2
0.3
0.4
YRI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
YRI!
0.0
0.1
0.2
0.3
0.4
TSI
1 2 3 5 10 25 50 150
B : 0−50B : 476−525B : 951−1000
TSI!
Low-!Coverage!
High-!Coverage!
1
ratiovec[1]
ESN
LWK
YRI
MSL
GWD
IBS
CEU TSI
GBR FIN
CHB
JPT
CHS
KHV
CDX
BEB
PJL
ITU
STU
GIH
1.15
1.25
1.35
1.45 ●
Low−CoverageHigh−Coverage
Ratio of Singleton Frequency in Strong BGS Bin vs. Weak BGS Bin!
ITU!
AFR!!
EUR!!
EASN!!
SASN!!
ratio!
Estimate2 of the Strength of Background Selection Across Chromosome 22!
Frac
tion
of D
iver
sity!
Redu
ced
Due
to B
GS
(1-B)!
• Neutral sites defined as PhyloP ⊂ (-1.2, 1.2)
Modeling assumptions impact results
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns = -5
distance (bp)
ππ 0 L = 5e4
L = 5e5L = 1e6L = 1e7L = 5e7L = 1e8
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns = -5
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns = -5
distance (bp)
ππ 0 L = 5e4
L = 5e5L = 1e6L = 1e7L = 5e7L = 1e8
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns = -5
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns = -5
distance (bp)
ππ 0 L = 5e4
L = 5e5L = 1e6L = 1e7L = 5e7L = 1e8
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns = -5
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns = -5
distance (bp)
ππ 0 L = 5e4
L = 5e5L = 1e6L = 1e7L = 5e7L = 1e8
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns = -5
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Multiplicative, 2Ns~Γ(α, β)
distance (bp)
ππ 0
-20000 -10000 0 10000 20000
0.0
0.2
0.4
0.6
0.8
1.0
Additive, 2Ns~Γ(α, β)
distance (bp)
ππ 0
Lawrence Uricchio
Complex signatures of selection
• Soft selective sweeps result in multiple haplotypes increasing in frequency.
Soft Sweep
Zach Szpiech
Extended Multiple Haplotype Homozygosity
-- haplotype sample size!
-- set of distinct haplotypes from the locus to marker x!
-- ith most frequent haplotype!
-- number of haplotypes
ï�H��� ï�H��� �H��� �H��� �H���
���
���
���
���
���
���
'LVWDQFH�IURP�FRUH��ES�
EH
H
$GDSWLYH1HXWUDO
SelScan: Szpiech & Hernandez (arXiv)
Sorry, redacted for now… More coming soon!!
Power
0 0.01 0.02 0.05 0.10
160%
120%
80%
40%
0%
Constant Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
% in
crea
se in
pow
er o
ver i
HS
Sampling Frequency
0 0.01 0.02 0.05 0.10
140%120%
60%
100%80%
40%20%
0%
African Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
% in
crea
se in
pow
er o
ver i
HS
Sampling Frequency
0 0.01 0.02 0.05 0.10
60%
100%
80%
40%
20%
0%
European Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
% in
crea
se in
pow
er o
ver i
HS
Sampling Frequency
0 0.01 0.02 0.05 0.10
100%
60%
80%
40%
20%
0%
Constant Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
Pow
er
Sampling Frequency
0 0.01 0.02 0.05 0.10
100%
60%
80%
40%
20%
0%
African Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
Pow
er
Sampling Frequency
0 0.01 0.02 0.05 0.10
100%
60%
80%
40%
20%
0%
European Demography (s = 0.01)
0.70
0.80
0.90
Frequency at which selection begins
Pow
er
Sampling Frequency
A genomic approach to detecting selection
• Most SNPs are non-coding.
• Most regulatory elements do not act on the nearest gene.
• We can use genome-wide signatures of selection to infer selection on genes using eQTL information.
ARTICLE
Sherlock: Detecting Gene-Disease Associationsby Matching Patterns of Expression QTL and GWAS
Xin He,1,2 Chris K. Fuller,1 Yi Song,1 Qingying Meng,3 Bin Zhang,4 Xia Yang,3 and Hao Li1,*
Genetic mapping of complex diseases to date depends on variations inside or close to the genes that perturb their activities. A strong
body of evidence suggests that changes in gene expression play a key role in complex diseases and that numerous loci perturb gene
expression in trans. The information in trans variants, however, has largely been ignored in the current analysis paradigm. Here we pre-
sent a statistical framework for genetic mapping by utilizing collective information in both cis and trans variants. We reason that for a
disease-associated gene, any genetic variation that perturbs its expression is also likely to influence the disease risk. Thus, the expression
quantitative trait loci (eQTL) of the gene, which constitute a unique ‘‘genetic signature,’’ should overlap significantly with the set of loci
associated with the disease. We translate this idea into a computational algorithm (named Sherlock) to search for gene-disease associa-
tions from GWASs, taking advantage of independent eQTL data. Application of this strategy to Crohn disease and type 2 diabetes pre-
dicts a number of genes with possible disease roles, including several predictions supported by solid experimental evidence. Importantly,
predicted genes are often implicated by multiple trans eQTL with moderate associations. These genes are far from any GWAS association
signals and thus cannot be identified from the GWAS alone. Our approach allows analysis of association data from a new perspective and
is applicable to any complex phenotype. It is readily generalizable to molecular traits other than gene expression, such as metabolites,
noncoding RNAs, and epigenetic modifications.
Introduction
Recent application of genome-wide association studies(GWASs) to complex human diseases led to the discoverythat the majority of disease-associated variants (estimatedto be as high as 88%) are located in noncoding sequences,potentially affecting gene expression rather than proteinfunction.1,2 Because of the complexity of gene regulation,the expression of a gene can be modulated bymutations incis (proximal to the gene) and/or in trans (distal to the geneor on different chromosomes, such as upstream transcrip-tion/chromatin factors, distal regulatory elements, etc.).3,4
As a result of a large mutational target size (primarilybecause of mutations in trans) and the buffering of generegulatory systems that helps tolerate expression changes,genetic variants altering expression levels are common inpopulations.5 Indeed, many studies of expression quanti-tative trait loci (eQTL) demonstrate that the expressionof most genes is influenced bymultiple loci, most of whichact in trans.6–9 Despite their individually small effect sizes,trans eQTL are collectively important for variation of geneexpression and by some estimates account for a larger pro-portion of the heritability of gene expression than do ciseQTL.8,10
Because of their prevalence in the human population,expression variations, especially those in trans, provide sys-tematic perturbations of the gene regulatory networks un-derlying various complex phenotypes, and as such mightreveal important information about the genetic basis ofthese phenotypes. Thus there is a pressing need to developa general framework to mine the collective information in
both cis- and trans-expression QTL in the context of associ-ation studies. So far, information from trans variations haslargely been ignored because only cis variants can be as-signed to their target genes based on proximity by usingthe GWAS data alone. The growing collection of eQTLdata for various human tissues makes it possible to asso-ciate trans variants with target genes.8 Although previousstudies demonstrated the utility of eQTL data for aidingthe analysis of association studies,9,11 most of these usedonly cis eQTL located close to the genes. This reflectssome fundamental difficulties of utilizing information intrans. Because trans eQTL are usually much weaker thanthose in cis, the statistical signal of an individual transeQTL is difficult to detect—it may fall far below thegenome-wide threshold.9 Another major challenge is thepleiotropic effect of trans variation. A gene’s trans perturba-tion may come from the mutation of a regulatory mole-cule, but this mutation may also affect multiple othergenes.Here we present a general strategy to infer genes whose
perturbations contribute to the etiology of complex dis-eases by tapping into statistical information providedby both cis and trans variations affecting gene expres-sion. Although individual variants are often weak andnot particularly informative, the overall pattern ofexpression variants of a gene can provide a strong statis-tical signal. A unique aspect of this strategy is thatbecause we utilize trans variants far from target genes,it is possible to identify important genes distal to anyGWAS association signals and thus impossible to detectwith GWAS alone.
1Department of Biochemistry and Biophysics, University of California at San Francisco, San Francisco, CA 94143, USA; 2Lane Center of ComputationalBiology, Carnegie Mellon University, Pittsburgh, PA 15213, USA; 3Department of Integrative Biology and Physiology, University of California at Los An-geles, Los Angeles, CA 90095, USA; 4Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029, USA*Correspondence: [email protected]://dx.doi.org/10.1016/j.ajhg.2013.03.022. !2013 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 92, 667–680, May 2, 2013 667
He et al. AJHG (2013)
Detecting selection on regulatory networks
The likelihood function at each SNP at a given Z is computed by
summing over the hidden variables Ui and Vi:
P!xi; yi jZ
"¼X
Ui ;Vi
PðUiÞPðVi jZ;UiÞPðxi jUiÞP!yi jVi
": (Equation 3)
Ui is a Bernoulli random variable with the success probability a
(the prior probability of a SNP being associated with the expres-
sion trait). The information about the gene-disease relationship
is encoded in the conditional probability P(Vi j Z, Ui). When
Z ¼ 0 or when Z ¼ 1 and Ui ¼ 0 (a false eSNP), Vi is a Bernoulli
random variable with the success probability b (the prior probabil-
ity of a SNP being associated the phenotype). When Z ¼ 1 and
Ui ¼ 1, according to our discussion before, Vi should also be 1.
The probability terms P(xi j Ui) and P(xi j Vi) reflect the distribution
of p values under the null or alternative hypothesis, and we derive
these distributions in Appendix A.
The Bayes factor defined in Equation 2 can be expressed in terms
of the following variables:
Bi;x ¼Pðxi jUi ¼ 1ÞPðxi jUi ¼ 0Þ Bi;y ¼
P!yi jVi ¼ 1
"
P!yi jVi ¼ 0
": (Equation 4)
These are the Bayes factors measuring the association of the ith
SNP with the expression and the phenotypic trait, respectively.
We show, in Appendix B, that the BF of the ith SNP in Equation
2 is given by
Bi ¼1$ a
1$ aþ aBi;xþ aBi;x
1$ aþ aBi;x
Bi;y
1$ bþ bBi;y: (Equation 5)
Thus the Bayes factor of the gene being tested depends only on the
parameters a, b, and the SNP-level Bayes factors. If Bayesian
inference has been performed in both the eQTL and GWAS anal-
ysis, it is straightforward to combine the resulting BFs to obtain
the BF for the gene.
Figure 1. The Sherlock Algorithm: Matching Genetic Signatures of Gene Expression Traits to that of the Disease to Identify Gene-Disease Associations(A) Perturbation of the expression level of a disease-associated gene at any of its eQTL changes the disease risk, and thus theeQTL tend to be associated with the complex disease as well (the dashed lines). The eQTL associations may contain falsepositives, so we use binary indicator variables, U, to represent the true SNP-gene expression relationship; similarly we use indicatorvariables, V, for the SNP-disease relationship. Z is a binary variable indicating whether the expression trait influences the diseaserisk.(B) Hypothetic genome-wide association plots of the causal expression trait (top) and a complex disease (bottom). The geneticsignature of the gene expression trait partially overlaps with that of the disease. Red arrows indicate the matched loci.(C) Alignment of genetic signatures of a gene expression trait and the phenotype. Three different scenarios are shown, representedby the green, red, and black boxes.(D) The probabilistic model representing the dependency of the variables. The semantics of the variables U, V, and Z are shown in(A). When Z ¼ 0, U and V are independent (top). When Z ¼ 1, V depends on both Z and U; if U ¼ 1, then V is also likely to be 1(bottom). The association statistics of a SNP with respect to the gene expression trait and the disease (x and y) depend on thehidden variables U and V. Shaded and open circles indicate observed and latent variables, respectively.
The American Journal of Human Genetics 92, 667–680, May 2, 2013 669
He et al. AJHG (2013)
Detecting selection on regulatory networksRank GENE BF
1 IFNAR2 14.0216
2 DARS 13.3106
3 RARRES2 12.7859
4 SLC25A43 11.8157
5** EXT1 11.4169
6 FAM20B 11.3852
7** MICB 11.2997
8** MICA 11.2997
9** HLA-‐B 11.2997
10** HLA-‐C 11.2359
11 RHBDL1 11.1828
12** RBMS3 11.142
13 FNBP1 11.1387
14 P4HB 10.8784
15** SOX5 10.8667
Rank GENE BF
16 KCNK3 10.7472
17 RGS20 10.5487
18 MPST 10.5474
19** HLA-‐DPB1 10.4441
20 QSOX1 10.4326
21** IL16 10.4201
22** SYT17 10.3908
23 MALL 10.3165
24** CRTC1 10.2577
25 MEMO1 10.2574
26 ISOC2 10.2464
27 PCF11 10.0775
28 XKR8 10.0043
29 RNF216L 10.0043
30** SCG2 10.0012
** indicates genes in GWAS association database for complex phenotype
Selection on standing variation driven by response to pathogens
Description P-value FDR q-value
cytokine-mediated signaling pathway 5.92E-06 6.26E-02
immune effector process 7.47E-06 3.95E-02
regulation of immune system process 7.47E-06 2.64E-02
regulation of defense response to virus 8.53E-06 2.26E-02
lymphocyte costimulation 9.36E-06 1.98E-02
T cell costimulation 9.36E-06 1.65E-02
GOrilla
Haplotype-based selection signals recapitulate geography
−5 0 5
−50
510
Top 1% of windows
PC1 (14.4%)
PC2
(12.
6%)
ACBASW
CDX
CEU
CHBCHS
CLMFIN GBRGIH IBS
JPT
KHV
LWK
MKK
MXLPEL PUR TSI
YRI• TGP samples with phased OMNI genotype data
• Used iHS
• 100kb windows for each population are coded 1 if selection score is in top 1% (0 otherwise)
Conclusions• Many complex signatures of selection in the human
genome.
• Mixtures of positive and negative selection
• Complicated modes of selection (including soft sweeps)
• Predominant signature of ancient human-lineage selection seems to be from olfactory processes
• Recent selection on standing variation associated with complex traits, including pathogen response.
Thanks!
1000 Genomes Project Consortium
Funding: NHGRI; QB3; CHARM; CTSI
Nicolas Strauli
Cyrus Maher
Raul Torres
Lawrence Uricchio
Zach Szpiech