using genetics to study human history and natural selection david reich harvard medical school...
TRANSCRIPT
Using genetics to study human history and natural selection
David ReichHarvard Medical School Depatment of Genetics
Broad Institute
tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca
tc
ga
ga
ga
ga
ga
gc
gc
gc
tc
ga
ga
ga
ga
ga
tc
tc
tc
tc
ga
ga
ga
tc
gc
tc
tc
tc
A 2-part talk:
Section 1: How human history affects human genetic variation
Section 2: Detecting selection by the pattern of genetic variation and finding disease genes
How does human history affect genetic variation?
A genome-wide survey of Linkage Disequilibrium
Section 1
Linkage disequilibrium is a phenomenon whereby genetic variants are associated: people who have one tend to have a second as well
Linkage Disequilibrium Explained
Variations in Chromosomes Within a Population
Common Ancestor
Emergence of Variations Over Time
time present
Disease Mutation
Section 1
Time = present
What Determines Extent of LD?
2,000 gens. ago
Disease-Causing Mutation
1,000 gens. ago
Section 1
How Far Does Association (LD) Extend Between Neighboring Common Sites?
0kb160kb
80kb40kb20kb10kb5kb
Range of uncertainty
Section 1
• Theoretical: 3-8 kb
Strategy for Assessing Extent of LD
• 19 regions• 44 Caucasian samples from Utah• a great deal of DNA sequencing per sample
Distance from core single nucleotide polymorphism (SNP)
5 5 10 20 40 80
Section 1
0kb160kb
80kb40kb20kb10kb5kb
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2.8
Distance Between SNPs (Base Pairs)
Lin
ka
ge
Dis
eq
uil
ibri
um
|D'|
10kb5kb 20kb 80kb40kb 160kb unlinked1kb
Data
Previous Theoretical Prediction
Section 1
A Genome-Wide Assessment of Linkage Disequilibrium
Disease Gene Mapping
Human history
Section 1
MYSTERY: What explains the long-range LD?
Section 1
Important event in population history?
Positive Control: 48 Swedes
Identical pattern to Utah
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3.5
Distance Between SNPs (Base Pairs)
Lin
ka
ge
Dis
eq
uili
bri
um
D'
10kb5kb 20kb 80kb40kb 160kb
Utah LD Curve
Sweden LD
Sweden LD With Sign of D' set by Utah
Section 1
96 Nigerians (Yoruba)
Much Less LD
Associations in Africans a SUBSET of those in Caucasians
MUST be influenced by population history
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3.5
Distance Between SNPs (Base Pairs)
Lin
ka
ge
Dis
eq
uili
bri
um
D'
10kb5kb 20kb 80kb40kb 160kb
Utah LD Curve
Nigeria LD
Nigeria LD with sign of D' set by Utah
Section 1
Confirmation of less LD in Africans from Direct DNA Sequencing
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
500bp 5kb 10kb 20kb 40kb 80kb 160kb
Mea
n |D
'|
Nigerian
Utah
101
313
67
56
83
9816
174
86 6
48
4
6320
Anna DiRienzo also shows this pattern
Section 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50,000 100,000 150,000
Distance (bp)
Me
an
|D
'|CaucasianAfrican-AmericanAsianYoruban
More evidence from Genotyping~5,000 SNPs (Gabriel et al. 2002)
K. Kidd, J. Kidd, Sarah Tishkoff also show this
Section 1
Explanation: Bottleneck or ‘Founder Effect’ in History of North Europeans
What was this event?
(1) Out of Africa?
Ancestral Population
North Europeans
• likely <10 founding
chromosomes ~100,000years ago
YorubaAncestors
Section 1
(2) Founding of Europe?
Open Mysteries
Section 1
• what caused the bottleneck event?
“Out of Africa” migration?
• how many people involved? When did it occur?
• can we better understand when the founder
event occurred, and how many people involved?
Acknowledgements for Section 1
Collaborators:Michele CargillStacey BolkJames IrelandPardis C. SabetiDaniel J. RichterThomas LaveryRose KouyoumjianShelli F. FarhadianRyk WardEric S. Lander
Samples:Leif GroopRichard CooperCharles Rotimi
Using Long-Range Linkage Disequilibrium to Detect Positive Selectionin the Genome
Section 2
Overview
1. The difficulty of detecting genomic regions affected by natural selection
2. The long-range haplotype test
3. Results for two genes: G6PD and CD40 ligand
Section 2
Existing formal tests for selection
DNA Sequence analysis Tajima’s D HKA test Mcdonald and Kreitman Fu and Li’s D Ka/Ks ratio
Weak
Genotyping-based tests Not general at present
Section 2
Old alleles: • low or high frequency • short-range LD
Positive Selection
Our test is based on the relationship betweenallele frequency and extent of linkage disequilibrium
Young alleles: • low frequency • long-range LD
No selection
Young alleles: • high frequency • long-range LD
Section 2
The signal of selection
frequency
Link
age
Dis
equi
libriu
m
(Hom
ozyg
osity
)
Neutrality
Positive Selection
Section 2
gene
Paradigm of the Core Region
5
3
2
1
4
Core Haplotypes
Section 2
Long-range multi-SNP haplotypes
5
3
2
1
4
C/T A/G A/G C/T C/T C/T
Long-range markersCoremarkers
gene
Decay of LD
Section 2
Long-range multi-SNP haplotypes
100%
Decay of homozygosity
(probability, at any distance, that any two haplotypes that start out the same have all the same SNP genotypes) 18%
gene
C/T A/G A/G C/T C/T C/T
Coremarkers
Long-range markers
G G
C
C
C
C
T
T
T
T
C
T
75% 35%
T TC
C
A G
3
Section 2
CD40 ligand (2002):• Recent association by Sabeti et al.
• involved in immune regulation
Two genes associated with malaria resistance
• well established association to malaria resistance
G6PD (1960’s)
• selection demonstrated in 2001 by Tishkoff et al.
Section 2
Experimental Design
-180kb Gene +520kb
CD40 ligand (7 SNPs in core, 14 at long distances)
-480kb G6PD +220kb
-180kb TNFSF5 +520kb
telomere
-480kb Gene +220kb
telomere
G6PD (11 SNPs in core, 14 at long distances)
Section 2
Experimental Design
DNA samples from 231 African menYoruba (Nigeria)Beni (Nigeria)Shona (Zimbabwe)
Perfect phase (X chromosome)
Section 2
Core haplotypesG6PD
5
3
2
1
4
Africans(230)
6
7
8
9
38 72 428281441 5
46113 17
non-Africans(95)
CD40 ligand
591 97830 1
5
3
2
1
4
6
Africans(231)
77 21 7 7
non-Africans(91)
“A-” protective haplotype
Section 2
G6PD: long-range haplotype diversity
G6PD-corehap1 G6PD-corehap6
G6PD-corehap3 G6PD-corehap7
G6PD-corehap4 G6PD-corehap8
G6PD-corehap5 G6PD-corehap
G6PD-corehap8“A-” protectivehaplotype
Section 2
G6PD: homozygosity vs. distanceE
HH
Distance from the core region (kb)
Section 2
G6PD: computer simulation vs. data
Core haplotype frequency
Rel
ativ
e E
HH
Core haplotype 8P << 0.0008
Section 2
G6PD: P-values from simulationP
- val
ue
Distance from the core region (kb)
Section 2
G6PD also stands out in comparison to 7 control regions
Core haplotype frequency
Rel
ativ
e E
HH
Section 2
CD40 ligand:long-range haplotype diversity
corehap1 corehap4
corehap2 corehap5
corehap3
corehap4
Section 2
CD40 ligand: homozygosity vs. distanceE
HH
Distance from the core region (kb)
Section 2
CD40 ligand: computer simulation vs. data
Core haplotype frequency
Rel
ativ
e E
HH
Core haplotype 4P << 0.0011
Section 2
CD40 ligand: P-values from simulationP
- val
ue
Distance from the core region (kb)
Section 2
CD40 ligand also stands out incomparison to 7 control regions
Core haplotype frequency
Rel
ativ
e E
HH
Section 2
Malaria resistance arosein last 10,000 years in Africa
~2,500 years ago for G6PD
~6,500 years ago for CD40 ligand
Long-range linkage disequilibrium also gives a direct estimate of the date
Section 2
Traditional tests fail to detect the effect
Tajima’s D HKA test Mcdonald and Kreitman Fu and Li’s D Ka/Ks ratio
Not significant in our data. This test is a powerful way to detect selection in last 10,000 years
Section 2
3
2
1
4
Conclusions: Powerful general approach for detecting selection
Section 2
3
2
1
4
5
Conclusions: Powerful general approach for detecting selection
Section 2
3
2
1
4
Screen the genome for Postive Selection
Conclusions: Powerful general approach for detecting selection
Section 2
Conclusions: Genome-wide screen for natural selection
We can find disease genes without patients!
Section 2
What’s coming…Section 2
1. Generalization of the long-range haplotype test
2. Application of the approach genome-wide
• Haplotype map data set
• Disease gene screen data sets
Acknowledgements for Section 2Pardis C. SabetiJohn HigginsHaninah Z.P. LevineDaniel J. RichterStephen F. SchaffnerStacey GabrielJill V. PlatkoNicholas J. Patterson
Gavin J. McDonaldHans C. AckermanSarah J. CampbellDavid AltshulerRichard CooperRyk WardEric S. Lander
Note
The 3rd section of the talk is not included here because it presents data that have not yet been published.