epi 511, advanced population and medical genetics
TRANSCRIPT
Alkes Price
Harvard School of Public Health
January 24 & January 26, 2017
EPI 511, Advanced Population and Medical Genetics
Week 1:
• Intro + HapMap / 1000 Genomes
• Linkage Disequilibrium
EPI 511: Course structure
Week 1: HapMap, 1000G / Linkage disequilibrium
Week 2: Population structure and admixture
Week 3: Population stratification
Week 4: Fine-mapping / Natural selection
Week 5: Heritability / Genetic risk prediction
Week 6: Mixed models / Rare variant analysis
Week 7: Functional interpretation
EPI 511: How to address the instructor
Alkes
Dr. Price
Professor Price
Honorable Professor Price
Honorable Distinguished Dr. Professor Price
EPI 511: Office Hours
Instructor: Alkes
Office Hours: Thu 3:30-4:30pm, Building 2, Room 211
Email Address: [email protected]
(Please put EPI511 in the subject of your email)
Teaching Assistant: Armin
Office Hours: Fri + Mon 2-3pm, Building 2, Room 209
Email Address: [email protected]
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
Video of each class will be posted on
the course www site <1hr after class.
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28
• short Research Paper due Fri Mar 10
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28
• short Research Paper due Fri Mar 10
• self-assessment Opportunity
20min exam (date will not be announced in advance)
EPI 511: Outcome measures
• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session
• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class
EPI 511: Outcome measures
• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session
• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class
• Experiences (60% of course grade) 6 take-home projects (data and programming intensive)
Approaches to Scientific Understanding
Love is Understanding.
-- Madonna
Data is Understanding.
-- Alkes
EPI 511: Outcome measures
• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session
• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class
• Experiences (60% of course grade) 6 take-home projects (data and programming intensive)
Approaches to Scientific Understanding
Understanding Data requires Fixing Bugs.
Genetics + data + programming = bright future
Gewin 2007 Nature Hayden 2012 Nature
EPI 511: Outcome measures
• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session
• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class
• Experiences (60% of course grade) 5 take-home projects (data and programming intensive)
• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)
EPI 511: Outcome measures
• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session
• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class
• Experiences (60% of course grade) 5 take-home projects (data and programming intensive)
• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)
• self-assessment Opportunity (0% of course grade)
20min exam (date will not be announced in advance)
EPI 511: Policy on group work
Experiences (60% of course grade) 6 take-home projects (data and programming intensive)
• OK to discuss experiences with your colleagues
• Each piece of code you write should be your own
short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)
• OK to discuss the project with your colleagues
• Each piece of code you write should be your own
• Each piece of text you write should be your own
EPI 511, Advanced Population and Medical Genetics
Week 1:
• Introduction + HapMap Project
• Linkage Disequilibrium
Outline
1. Introduction to Population Genetics
2. HapMap and HapMap2 projects
3. FST
4. HapMap3 and 1000 Genomes projects
Outline
1. Introduction to Population Genetics
2. HapMap and HapMap2 projects
3. FST
4. HapMap3 and 1000 Genomes projects
What is Population Genetics?
Population genetics is the study of genetic variation
both within and between human populations.
Are different human populations
actually genetically different?
Are different human populations
actually genetically different?
Slightly.
5-7% of worldwide human genetic variation is due to
genetic differences between human populations.
The remaining 93-95% of human genetic variation is due to
genetic variation within human populations
(Rosenberg et al. 2002 Science).
Why study differences between
human populations?
• Learn about human migration patterns and ancient history.
Why study differences between
human populations?
• Learn about human migration patterns and ancient history.
• Improve our power to identify and localize disease genes.
Rosenberg et al. 2010
Nat Rev Genet
Bustamante et al. 2011 Nature; also see Popejoy & Fullerton 2016 Nature
Why study differences between
human populations?
• Learn about human migration patterns and ancient history.
• Improve our power to identify and localize disease genes.
Williams et al. 2014 Nature
Why study differences between
human populations?
• Learn about human migration patterns and ancient history.
• Improve our power to identify and localize disease genes.
- Use differences in linkage disequilibrium for fine-mapping.
- Avoid false positives due to population stratification.
- Signals of natural selection at genes related to disease.
Does “race” exist?
Does “race” exist?
Worldwide patterns of human genetic variation are best
described using continuous clines instead of discrete clusters.
(Serre & Paabo 2004 Genome Res)
Racial classifications are inadequate descriptors of the
distribution of human genetic variation.
(Tishkoff & Kidd 2004 Nat Genet)
For a fun time: go to a population genetics party and ask,
Isn’t it politically incorrect to study
differences between human populations?
Isn’t it politically incorrect to study
differences between human populations?
No. It is not politically incorrect.
Isn’t it politically incorrect to study
differences between human populations?
No. It is not politically incorrect.
“Studies of human population genetics have generated the
strongest proof that there is no scientific basis for racism.”
(Cavalli-Sforza 2005 Nat Rev Genet)
also see Cavalli-Sforza et al. 1994 The History and Geography of Human Genes
Outline
1. Introduction to Population Genetics
2. HapMap and HapMap2 projects
3. FST
4. HapMap3 and 1000 Genomes projects
The International HapMap Project (International HapMap Consortium 2005 Nature)
CEU (European) CHB (Chinese)
JPT (Japanese) YRI (Nigerian)
CEU northern European USA 90
CHB Chinese China 45
JPT Japanese Japan 44
YRI Yoruba Nigeria 90
The International HapMap Project: 270 samples from 4 populations
The International HapMap Project (International HapMap Consortium 2005 Nature)
CEU (European) CHB (Chinese)
JPT (Japanese) YRI (Nigerian)
Phase I HapMap:
>1,000,000 SNPs
The International HapMap Project (International HapMap Consortium 2007 Nature)
CEU (European) CHB (Chinese)
JPT (Japanese) YRI (Nigerian)
Phase II HapMap:
>3,000,000 SNPs
What is a SNP?
A Single Nucleotide Polymorphism (SNP) is a letter of the
genome that differs in different individuals (e.g. G/T).
What is a SNP?
Rosenberg & Nordborg 2002 Nat Rev Genet
A Single Nucleotide Polymorphism (SNP) is a letter of the
genome that differs in different individuals (e.g. G/T).
Each SNP corresponds to one single mutation event in history,
e.g. G mutated to T in one single ancestor.
G = ancestral allele, T = derived allele.
Coalescent tree
What is a SNP: physical position
Each SNP has a physical position on a chromosome.
physical
chrom. position (bp)
rs10910034 1 2165898
rs1713712 1 2166021
… … …
What is a SNP: physical vs. genetic position
Each SNP has a physical and genetic position on a chromosome.
physical genetic position
chrom. position (Morgans)
rs10910034 1 2165898 0.01904785
rs1713712 1 2166021 0.01904814
… … … …
1 recombination event per Morgan per generation.
Genome-wide recombination rate is about 1cM / Mb.
[cM = centiMorgan = 1/100 Morgan, Mb = Megabase = 106 bp]
Thus, 1 Morgan is roughly 100Mb = 108 bp on average.
HapMap project: Summary of main results
• 3.1 million SNPs successfully genotyped using Perlegen
genotyping technology (Hinds et al. 2005 Science).
• These 3.1 million SNPs: about 30% of all common SNPs
(defined as SNPs with minor allele frequency >5%).
CEU northern European USA 90
CHB Chinese China 45
JPT Japanese Japan 44
YRI Yoruba Nigeria 90
HapMap: 270 samples from 4 populations
Affymetrix and
Illumina chips
HapMap project: Summary of main results
• 3.1 million SNPs successfully genotyped using Perlegen
genotyping technology (Hinds et al. 2005 Science).
• These 3.1 million SNPs: about 30% of all common SNPs
(defined as SNPs with minor allele frequency >5%).
“Properties of SNPs are influenced by discovery sampling …
HapMap relied on nearly any piece of information available.”
Clark et al. 2005 Genome Res; also see Keinan et al. 2007 Nat Genet
Summary of main results, continued
• Understanding genetic differences between populations.
• Patterns of linkage disequilibrium both within and across
populations.
• Most common SNPs in the human genome are in strong
linkage disequilibrium with at least one HapMap SNP
[avg r2 ≥ 0.90 in 10 sequenced ENCODE regions].
Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)
77% frequency
68% frequency
50% frequency C allele of rs10910034
Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)
FST = 0.19
FST = 0.11
FST = 0.16
Note: FST accounts for
sampling error due to
finite sample size.
Populations can be distinguished using
a large number of genetic markers
Principal Components Analysis
using 100 markers
Populations can be distinguished using
a large number of genetic markers
using 3 million markers
Principal Components Analysis
Outline
1. Introduction to Population Genetics
2. HapMap and HapMap2 projects
3. FST
4. HapMap3 and 1000 Genomes projects
Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)
FST = 0.19
FST = 0.11
FST = 0.16
Defining vs. Estimating FST
• FST is an underlying parameter that depends on the two
populations, but does not depend on a particular finite sample.
• FST is an estimate of the underlying FST that depends on a
particular finite sample that is analyzed.
Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res
^
Defining FST
Definition:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
p
p2 p1
FSTp(1 – p) FSTp(1 – p)
Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res
Defining FST
Definition:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
p1 ~ N(p, FSTp(1 – p))
p
p2 p1
FSTp(1 – p) FSTp(1 – p)
Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res
Defining FST
Definition:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
p1 ~ Beta(p(1 – FST)/FST, (1 – p)(1 – FST)/FST)
p
p2 p1
FSTp(1 – p) FSTp(1 – p)
Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res
Defining FST
Definition:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
OR
• The FST between two populations is equal to the proportion
of genotypic variance in a set of N individuals from each
population that is attributable to population differences.
Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res
Defining FST
Theorem 1:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
=>
• The FST between two populations is equal to the proportion
of genotypic variance in a set of N individuals from each
population that is attributable to population differences.
Defining FST
Proof: Let pavg = (p1 + p2)/2.
Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)
[Note that individuals are diploid: genotype = 0 or 1 or 2.
Binomial sampling with n=2.]
Defining FST
Proof: Let pavg = (p1 + p2)/2.
Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)
[Note that individuals are diploid: genotype = 0 or 1 or 2.
Binomial sampling with n=2.]
Genotypic variance attributable to population differences:
Suppose we have N data points with value 2p1, N with value 2p2
After subtracting the average value (p1 + p2), we have
N data points with value (p1 – p2), N with value (p2 – p1).
Since p1 and p2 each have variance FSTp(1 – p), it follows that
(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)
Defining FST
Proof: Let pavg = (p1 + p2)/2.
Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)
[Note that individuals are diploid: genotype = 0 or 1 or 2.
Binomial sampling with n=2.]
Genotypic variance attributable to population differences:
Suppose we have N data points with value 2p1, N with value 2p2
After subtracting the average value (p1 + p2), we have
N data points with value (p1 – p2), N with value (p2 – p1).
Since p1 and p2 each have variance FSTp(1 – p), it follows that
(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)
2FSTp(1 – p) / 2p(1 – p) = FST. Q.E.D.
Defining FST
Theorem 1′:
• The FST between two populations is the value such that the
allele frequency difference between the two populations has
mean 0 and variance 2FSTp(1 – p), where p is the allele
frequency in the ancestral population.
=>
• The proportion of genotypic variance in a set of
αN individuals from population 1 and (1 – α)N individuals
from population 2 that is attributable to population differences
is equal to 4α(1 – α) · FST.
Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)
FST = 0.19
FST = 0.11
FST = 0.16
Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)
FST = 0.19
FST = 0.11
FST = 0.16
[2FSTp(1 – p)]1/2 = 0.23
for p = 0.5
[2FSTp(1 – p)]1/2 = 0.31
for p = 0.5
[2FSTp(1 – p)]1/2 = 0.28
for p = 0.5
Genetic distances (FST) between
European American subpopulations
Ashkenazi
Northwest Eur. Southeast Eur.
FST = 0.009 FST = 0.004
FST = 0.005
Price, Butler et al. 2008 PLoS Genet
Genetic distances (FST) between
European American subpopulations
Ashkenazi
Northwest Eur. Southeast Eur.
FST = 0.009 FST = 0.004
FST = 0.005
Price, Butler et al. 2008 PLoS Genet
[2FSTp(1 – p)]1/2 = 0.067 for p = 0.5
[2FSTp(1 – p)]1/2 = 0.050 for p = 0.5
[2FSTp(1 – p)]1/2 = 0.045 for p = 0.5
Genetic distances (FST) between
East Asian subpopulations
FST = 0.007
International HapMap Consortium 2007 Nature
Chinese Japanese
[2FSTp(1 – p)]1/2 = 0.059 for p = 0.5
Genetic distances (FST) between
West African subpopulations
FST = 0.008
International HapMap3 Consortium 2010 Nature
[2FSTp(1 – p)]1/2 = 0.063 for p = 0.5
Yoruba
(Nigeria)
Luhya
(Kenya)
How do we estimate FST?
p1 and p2 are allele frequencies in 2 populations
Var(p1 – p2) = 2FSTp(1 – p).
Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).
= E((p1 – p2)2 / [2p(1 – p)]).
How do we estimate FST?
p1 and p2 are allele frequencies in 2 populations
Var(p1 – p2) = 2FSTp(1 – p).
Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).
= E((p1 – p2)2 / [2p(1 – p)]).
A PROBLEM: we don’t get to observe p (ancestral frequency)
SOLUTION: approximate p ≈ pavg = (p1 + p2)/2.
How do we estimate FST?
p1 and p2 are allele frequencies in 2 populations
Var(p1 – p2) = 2FSTp(1 – p).
Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).
= E((p1 – p2)2 / [2p(1 – p)]).
A BIGGER PROBLEM: we don’t get to observe p1 and p2.
We only get to observe sample allele frequencies p1 and p2
in sample sizes N1 (from pop. 1) and N2 (from pop. 2).
^ ^
How do we estimate FST?
p1 and p2 are allele frequencies in 2 populations
Var(p1 – p2) = 2FSTp(1 – p).
Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).
= E((p1 – p2)2 / [2p(1 – p)]).
SOLUTION:
Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate
FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)])
(where we approximate p ≈ (p1 + p2)/2)
^ ^
^ ^
^ ^
some details omitted; see Bhatia et al. 2013 Genome Res
How do we estimate FST?
p1 and p2 are allele frequencies in 2 populations
Var(p1 – p2) = 2FSTp(1 – p).
Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).
= E((p1 – p2)2 / [2p(1 – p)]).
SOLUTION:
Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate
FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)]).
OR FST = Σi [(pi1 – pi2)2 – (1/(2N1) + 1/(2N2))pi(1 – pi)]
Σi [2pi(1 – pi)]
^ ^
^ ^
some details omitted; see Bhatia et al. 2013 Genome Res
^ ^ (where i
indexes
SNPs)
Drift vs. Divergence
YRI CHB CEU
0.02
0.04 0.07
0.10
YRI YRI CEU CEU CHB CHB
Divergence
(per 1000bp of DNA)
0.84 0.60 0.57
Keinan et al. 2007 Nat Genet
NA18488 NA06989 NA18597
Drift
(FST)
Drift vs. Divergence
Drift
(FST)
YRI CHB CEU
0.02
0.04 0.05
0.10
YRI YRI CEU CEU CHB CHB
Divergence
(generations)
~30K
gen.
Keinan et al. 2007 Nat Genet
NA18488 NA06989 NA18597
Based on mut. rate 1.2–1.8 x 10-8
(Kong et al. 2012 Nature,
Sun et al. 2012 Nat Genet)
~20K
gen.
~20K
gen.
Outline
1. Introduction to Population Genetics
2. HapMap and HapMap2 projects
3. FST
4. HapMap3 and 1000 Genomes projects
CEU northern European USA 90
CHB Chinese China 45
JPT Japanese Japan 44
YRI Yoruba Nigeria 90
HapMap: 270 samples from 4 populations
Affymetrix and
Illumina chips
Perkel 2008 Nat Methods
The HapMap Project:
Work is done, relax on beach?
Beyond HapMap: what the world still needs
• Larger sample sizes for analyses of linkage disequilibrium
• More complete representation of world population diversity
e.g. South Asian and Native American genetic variation
• Analyses of copy number variation (CNV)
• Low-frequency variants (minor allele frequency <5%)
The International HapMap3 Project:
1,260 samples from 11 diverse populations
International HapMap3 Consortium 2010 Nature
CEU northern European USA 180
CHB Chinese China 90
JPT Japanese Japan 90
YRI Yoruba Nigeria 180
TSI Tuscan Italy 90
CHD Chinese USA 100
LWK Luhya Kenya 90
MKK Maasai Kenya 180
ASW African-American USA 90
MXL Mexican-American USA 90
GIH Gujarati-American USA 90
HapMap3: 1,260 samples from 11 populations
The HapMap3 project
• Larger sample sizes for analyses of linkage disequilibrium
• More complete representation of world population diversity
e.g. South Asian and Native American genetic variation
• Analyses of copy number variation (CNV)
• Low-frequency variants (minor allele frequency <5%)
International HapMap3 Consortium 2010 Nature
Data generation: SNPs and CNVs
Affymetrix 6.0 array
900K SNPs
940K copy-number probes
Illumina Infinium 1M array
1M SNPs, of which
80K targeted at CNV regions
1.5M SNPs passed QC in all populations
(99.3% concordance for 250K SNPs on both arrays)
Note: only 1.5M SNPs, versus 3.1 million SNPs in HapMap2
International HapMap3 Consortium 2010 Nature
Not all HapMap3 populations are
similar to a population from HapMap
HapMap3 population Closest pop.
from HapMap
FST
TSI (Tuscan) CEU 0.004
CHD (Chinese) CHB 0.001
LWK (Luhya) YRI 0.008
MKK (Maasai) YRI 0.03
ASW (African-American) YRI 0.01
MXL (Mexican-American) CEU 0.04
GIH (Gujarati-American) CEU 0.04
Approaches to Scientific Understanding
Love is Understanding.
-- Madonna
Data is Understanding.
-- Alkes
HapMap3 data: individual files
CEU.ind:
NA06989 F CEU
NA11891 M CEU
NA11843 M CEU
NA12341 F CEU
NA12739 M CEU
…
[sample ID] [sex] [popname]
HapMap3 data: SNP files
CEU.snp:
rs10458597 1 0.0 554484 C T
rs2185539 1 0.0 556738 C T
rs11240767 1 0.0 718814 C T
rs12564807 1 0.0 724325 A G
rs3131972 1 0.0 742584 G A
…
[SNP ID] [chr] [0.0] [position] [ref] [var]
HapMap3 data: genotype files
CEU.geno:
2222222222… [Each line is 1 SNP, each column is 1 indiv.]
2222222222…
2222222222…
2222222222…
1121212112…
…
[Number of copies of reference allele: 0 or 1 or 2.
9 denotes missing data.]
Note: the HapMap3 data files for this course are restricted to
~700K SNPs that are common (MAF>5%) in every population.
Beyond HapMap: what the world still needs
• Larger sample sizes for analyses of linkage disequilibrium
• More complete representation of world population diversity
e.g. South Asian and Native American genetic variation
• Analyses of copy number polymorphisms (CNV)
• Low-frequency variants (minor allele frequency <5%)
Common Disease/Common Variant hypothesis
Lander 1996 Science; Reich & Lander 2001 Trends Genet
reviewed in Gibson 2012 Nat Rev Genet, Visscher et al. 2012 Am J Hum Genet
“For common diseases, there will be one or a few
predominating disease alleles with relatively high frequencies at
each of the major underlying disease loci”
Are rare and low-frequency variants important?
Visscher et al. 2012 Am J Hum Genet
(to be continued, Thu of Week 6)
Are rare and low-frequency variants important?
Gibson 2012 Nat Rev Genet
(to be continued, Thu of Week 6)
Are rare and low-frequency variants important?
Kaiser 2012 Science (to be continued, Thu of Week 6)
HapMap3 1Mb pilot sequencing study
and 1000 Genomes pilot projects
International HapMap3 Consortium 2010 Nature
1000 Genomes Project Consortium 2010 Nature
• HapMap3 pilot sequencing: 10 100kb regions spanning 1Mb (high coverage: Sanger sequencing)
692 individuals from 10 HapMap3 populations
• 1000 Genomes Trio pilot project: Genome-wide (high coverage: 42x)
6 individuals (one CEU trio and one YRI trio)
• 1000 Genomes Low-coverage pilot project: Genome-wide (low coverage: 2x-6x)
179 individuals from CEU, YRI, CHB, JPT populations
• 1000 Genomes Exon pilot project: 8,140 exons spanning 1.4Mb from 906 genes (high coverage: >50x)
697 individuals from 7 HapMap3 populations
Sample size and SNP discovery (per Mb)
International HapMap3 Consortium 2010 Nature
The 1000 Genomes (1000G) Project
Sequence the entire genomes of 1,092 individuals:
379 of European ancestry (Europe and USA)
286 of East Asian ancestry (Asia)
246 of African ancestry (Africa and USA)
181 of Latino ancestry (Latin America and USA)
Use next-generation sequencing technologies (~4x coverage):
e.g. Illumina, 454, SOLiD (read lengths 25-400bp)
(Metzker 2010 Nat Rev Genet, Davey et al. 2011 Nat Rev Genet,
also see Nielsen et al. 2011 Nat Rev Genet)
1000 Genomes Project Consortium 2012 Nature
1000G project: Summary of main results
• 38 million SNPs discovered and successfully genotyped.
Most of these are rare and low-frequency variants.
• The 38 million SNPs include
99.7% of all SNPs with minor allele frequency 5%
98% of all SNPs with minor allele frequency 1% ***
50% of all SNPs with minor allele frequency 0.1%
based on an independent UK European sample.
***: stated goal to identify >95% of SNPs with frequency 1%
was successfully achieved.
1000 Genomes Project Consortium 2012 Nature
Common variants are shared across populations,
but rare variants are often population-private
1000 Genomes Project Consortium 2012 Nature
1000G project: the final phase
Sequence the entire genomes of 2,504 individuals:
503 of European ancestry (Europe and USA)
504 of East Asian ancestry (Asia)
661 of African ancestry (Africa and USA)
347 of Latino ancestry (Latin America and USA)
489 of South Asian ancestry (South Asia and USA)
Use next-generation sequencing technologies (~7x coverage):
Illumina only (read lengths 70-400bp only)
85 million SNPs, of which 64 million have MAF<0.5%
Related resource: UK10K project: 7x WGS of 3,781 UK samples
(UK10K Consortium 2015 Nature; also see Gudbjartsson et al. 2015 Nature)
1000 Genomes Project Consortium 2015 Nature
1000G project: the final phase
Sequence the entire genomes of 2,504 individuals:
503 of European ancestry (Europe and USA)
504 of East Asian ancestry (Asia)
661 of African ancestry (Africa and USA)
347 of Latino ancestry (Latin America and USA)
489 of South Asian ancestry (South Asia and USA)
Use next-generation sequencing technologies (~7x coverage):
Illumina only (read lengths 70-400bp only)
85 million SNPs, of which 64 million have MAF<0.5%
1000 Genomes Project Consortium 2015 Nature; also see UK10K Consortium
2015 Nature, Gudbjartsson et al. 2015 Nat Genet, McCarthy et al. 2016 Nat Genet
What about rare variants?
• The 1000G project has identified most low-frequency variants
(minor allele frequency 1%-5%). These variants can be placed
on genotyping arrays or imputed (see Thu of Week 1)
What about rare variants?
• The 1000G project has identified most low-frequency variants
(minor allele frequency 1%-5%). These variants can be placed
on genotyping arrays or imputed (see Thu of Week 1)
• Rare variants: most have not been identified by 1000 Genomes!
Must sequence disease samples directly.
Past focus has been mostly on exome sequencing, but
now shifting to whole-genome sequencing.
(to be continued, Thu of Week 6)
Kiezun et al. 2012 Nat Genet, Tennessen et al. 2012 Science, Pasaniuc et al. 2012 Nat Genet,
Purcell et al. 2014 Nature, Do et al. 2015 Nature, Cai et al. 2015 Nature. Reviewed in
Goldstein et al. 2013 Nat Rev Genet, Lee et al. 2014 Am J Hum Genet, Zuk et al. 2014 PNAS
• Human populations are slightly genetically different.
These differences may be important for disease mapping.
(see Thu slides: Linkage Disequilibrium.)
• FST quantifies differences between human populations.
• HapMap, HapMap2, HapMap3 and 1000 Genomes projects
provide a valuable resource for common & low-frequency
variants (but most rare variants have not yet been identified).
Conclusions
EPI 511, Advanced Population and Medical Genetics
Week 1:
• Intro + HapMap / 1000 Genomes
• Linkage Disequilibrium
EPI 511: Course components
• Advance reading 1 required paper + 1 optional paper per course session
• Lecture + Discussion discussants: each student to sign up as discussant for 1 class
Outline
1. Introduction to Linkage Disequilibrium
2. LD and Tag SNPs
3. LD and imputation
4. LD and fine-mapping
Outline
1. Introduction to Linkage Disequilibrium
2. LD and Tag SNPs
3. LD and imputation
4. LD and fine-mapping
Definition: Linkage Disequilibrium (LD) refers to
correlations between genotypes of nearby markers.
Linkage Disequilibrium
Definition: Linkage Disequilibrium (LD) refers to
correlations between genotypes of nearby markers.
Linkage Disequilibrium Association Studies
Linkage Disequilibrium Linkage Mapping
(reviewed in Ott et al. 2015 Nat Rev Genet)
Linkage Disequilibrium
Linkage Disequilibrium: Example
Individuals
1 2 3 4 5 6 7 8
A A
G A
T T
A A
C G
T T
G G
C C
A A ... …
A A
G G
T T
A A
C C
T T
G G
T T
A A
... …
SNP 1
SNP 2 3 billion
letters
A A
G G
T T
A A
C C
T T
G G
C T
A A ... …
A A
A A
T T
A A
G G
T T
G G
T C
A A ... …
A A
G G
T T
A A
C C
T T
G G
T T
A A ... …
A A
G A
T T
A A
C G
T T
G G
C T
A A ... …
A A
G G
T T
A A
C C
T T
G G
C T
A A ... …
A A
G A
T T
A A
C C
T T
G G
C C
A A ... …
YES,
in LD
Linkage Disequilibrium: Example
Individuals
1 2 3 4 5 6 7 8
A A
G A
T T
A A
C G
T T
G G
C C
A A ... …
A A
G G
T T
A A
C C
T T
G G
T T
A A
... …
SNP 1
SNP 2 3 billion
letters
A A
G G
T T
A A
C C
T T
G G
C T
A A ... …
A A
A A
T T
A A
G G
T T
G G
T C
A A ... …
A A
G G
T T
A A
C C
T T
G G
T T
A A ... …
A A
G A
T T
A A
C G
T T
G G
C T
A A ... …
A A
G G
T T
A A
C C
T T
G G
C T
A A ... …
A A
G A
T T
A A
C C
T T
G G
C C
A A ... …
SNP 3
YES,
in LD
NOT
in LD
Linkage Disequilibrium: Example
Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
SNP 1
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
r2=1,
in LD
r2=0,
NOT
in LD
r2 is squared correlation
Linkage Disequilibrium: Example
Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 1
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
0 0
1 1 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 1 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
r2=1,
in LD
r2=0.7,
partial
LD
r2 is squared correlation
Linkage Disequilibrium: Example
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 2 0 1 0 0 0
... … … … … … … …
SNP 1
SNP 2 3 billion
letters
SNP 3
r2=1,
in LD
r2=0.7,
partial
LD
r2 is squared correlation
Genotypes vs. Haplotypes: phasing
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 2 0 1 0 0 0
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
PHASING
Genotypes Haplotypes
Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,
Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,
Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet
Genotypes vs. Haplotypes: phasing
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 2 0 1 0 0 0
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
PHASING
Genotypes Haplotypes
Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,
Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,
Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet
Genotypes vs. Haplotypes: phasing
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 2 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 2 0 1 0 0 0
Individuals
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
PHASING
Genotypes Haplotypes
Fact: r2 between SNP1 and SNP2 (phased haplotype data) equals
r2 between SNP1 and SNP2 (unphased genotype data),
assuming Hardy-Weinberg equilibrium holds
Linkage Disequilibrium: Haplotype Blocks
Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 1
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
0 0
1 1 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 1 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
These 3 SNPs form a “haplotype block” with two main haplotypes
LD with phased haplotypes: r2 vs. D′
Slatkin 2008 Nat Rev Genet
Consider two SNPs with frequencies pA and pB of alleles A, B.
Let gA refer to # copies (0, 1) of allele A for the first SNP.
Let gB refer to # copies (0, 1) of allele B for the second SNP.
)1()1(
)(
)()(
)]()()([ 222
BBAA
BAAB
BA
BABA
pppp
ppp
gVargVar
gEgEggEr
LD with phased haplotypes: r2 vs. D′
Slatkin 2008 Nat Rev Genet
Consider two SNPs with frequencies pA and pB of alleles A, B.
Suppose pA < pB < 0.5.
)1()1(
2
2
BBAA
BAAB
pppp
pppr
BAA
BAAB
ppp
pppD
LD with phased haplotypes: r2 vs. D′
Slatkin 2008 Nat Rev Genet
Consider two SNPs with frequencies pA and pB of alleles A, B.
Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.
1
BAA
BAAB
ppp
pppD
BAB
BAA
BBAA
BAAB
ppp
ppp
pppp
pppr
)1()1(
2
2
LD with phased haplotypes: r2 vs. D′
Slatkin 2008 Nat Rev Genet
Consider two SNPs with frequencies pA and pB of alleles A, B.
Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.
e.g. pA = 0.25, pB = 0.4, pAB = 0.25 => r2 = 0.5, D′ = 1
1
BAA
BAAB
ppp
pppD
BAB
BAA
BBAA
BAAB
ppp
ppp
pppp
pppr
)1()1(
2
2
LD with unphased diploid genotypes
Slatkin 2008 Nat Rev Genet
Consider two SNPs with frequencies pA and pB of alleles A, B.
Let gA refer to # copies (0, 1, 2) of allele A for the first SNP.
Let gB refer to # copies (0, 1, 2) of allele B for the second SNP.
1
BAA
BAAB
ppp
pppD
...)()(
)]()()([ 22
BA
BABA
gVargVar
gEgEggEr
cannot be directly computed,
since pAB relies on phased data!
Approaches to Scientific Understanding
Love is Understanding.
-- Madonna
Data is Understanding.
-- Alkes
Linkage Disequilibrium: Haplotype Blocks
Slatkin 2008 Nat Rev Genet
Haplotype blocks in
216kb region (MHC, chr 6)
x-axis = y-axis =
SNP position in region
D′ and L are measures of LD
(related to r2)
Red indicates high LD
Black indicates low LD
Also see Haploview program, Barrett et al. 2005 Bioinformatics
200 kb
100 kb
0 kb
Linkage Disequilibrium: Haplotype Blocks
Europeans
and Asians
Africans
Gabriel et al. 2002 Science
also see Reich 2001 Nature, Daly 2001 Nat Genet
Linkage Disequilibrium: Haplotype Blocks
African chromosomes: 50% of the genome lies in
haplotype blocks >22kb.
Europeans and Asians: 50% of the genome lies in
haplotype blocks >44kb.
Longer haplotype blocks in Europeans/Asians due to
out-of-Africa population bottleneck: descended from
small number of ancestors who left Africa 60-40 kya.
Gabriel et al. 2002 Science
also see Reich 2001 Nature, Daly 2001 Nat Genet
A brief history of modern humans
Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,
Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS
A brief history of modern humans, contradicted
Green et al. 2010 Science, Reich et al. 2010 Nature, Meyer et al. 2012 Science,
Sankararaman et al. 2014 Nature, Vernot & Akey 2014 Science
reviewed in Racimo et al. 2015 Nat Rev Genet
• All non-African populations have ~2% of their genomes
descended from Neanderthals.
• Melanesian populations have ~5% of their genomes
descended from Denisovans, a relative of Neanderthals.
Population bottlenecks increase LD
population
bottleneck
population
bottleneck
Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,
Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS
Population bottlenecks increase LD
Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
r2=0,
NOT
in LD
r2 is squared correlation
Population bottlenecks increase LD
due to subsampling haplotypes (genetic drift) Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0 ... …
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
r2=0,
NOT
in LD
r2 is squared correlation
Population bottlenecks increase LD
due to subsampling haplotypes (genetic drift) Individuals
1 2 3 4 5 6 7 8
SNP 2 3 billion
letters
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
r2=0.5,
partial
LD
Population bottlenecks increase LD
due to subsampling haplotypes (genetic drift) Individuals
1 2 3 4 5 6 7 8
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 2 3 billion
letters
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
1 1
0 0
0 0
1 1
0 0
0 0
1 0
0 0 ... …
0 0
1 1
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0 ... …
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 ... …
SNP 3
r2 is squared correlation
r2=0.5,
partial
LD
Population bottlenecks increase LD
Conrad et al. 2006 Nat Genet
Average number of haplotypes per genomic region
Outline
1. Introduction to Linkage Disequilibrium
2. LD and Tag SNPs
3. LD and imputation
4. LD and fine-mapping
Linkage Disequilibrium and tag SNPs
Individuals
Cases Controls
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
SNP 1: causal SNP
3 billion
letters
Direct association: genotype SNP1 in Cases and Controls.
Linkage Disequilibrium and tag SNPs
Individuals
Cases Controls
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
SNP 1
3 billion
letters
Indirect association: genotype SNP2 in Cases and Controls.
If SNP1 affects disease risk, then SNP2 will also be associated!
SNP 2
r2=1,
in LD
Linkage Disequilibrium and tag SNPs
Individuals
Cases Controls
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
SNP 1
3 billion
letters
Indirect association: genotype SNP3 in Cases and Controls.
If SNP1 affects disease risk, then SNP3 will also be associated!
SNP 3
r2=0.7,
partial
LD
SNP 2
Linkage Disequilibrium and tag SNPs
Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):
If SNP1 is causal and LD(SNP1,SNP2) = r2, then
Power of an association study of SNP1 with N samples =
Power of an association study of SNP2 with N/r2 samples.
Linkage Disequilibrium and tag SNPs
Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):
If SNP1 is causal and LD(SNP1,SNP2) = r2, then
Power of an association study of SNP1 with N samples =
Power of an association study of SNP2 with N/r2 samples.
Proof:
Let g1 and g2 be genotypes of SNP1 and SNP2 respectively
and π be phenotype, all normalized to mean 0 and variance 1.
Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics).
Linkage Disequilibrium and tag SNPs
Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):
If SNP1 is causal and LD(SNP1,SNP2) = r2, then
Power of an association study of SNP1 with N samples =
Power of an association study of SNP2 with N/r2 samples.
Proof:
Let g1 and g2 be genotypes of SNP1 and SNP2 respectively
and π be phenotype, all normalized to mean 0 and variance 1.
Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics):
SNP1 with N samples: Nρ(g1, π)2 = NE(g1· π)2
SNP2 with N/r2 samples: (N/r2)ρ(g2, π)2 = (N/r2)E(g2 · π)2
= (N/r2)E([rg1 + (g2-rg1)] · π)2
= (N/r2)E(rg1· π)2 = NE(g1· π)2. Q.E.D.
Linkage Disequilibrium: Haplotype Blocks
Control Case
Case
Case
Case
Control
Control
Control
Risk haplotype
Question: Which SNP to genotype?
Answer: Choose 1 SNP per haplotype block,
and take advantage of indirect association!
Case Control
Linkage Disequilibrium: Haplotype Blocks
Control Case
Case
Case
Case
Control
Control
Control
Needed: a resource describing the haplotypes
at each location in the genome.
Case Control
Risk haplotype
The International HapMap Project: 270 samples from 4 populations
CEU European USA 90 30 trios
YRI Yoruba Nigeria 90 30 trios
CHB Chinese China 45 unrelated
JPT Japanese Japan 45 unrelated
Genetic differences between populations are small
68% frequency 50% frequency C allele of rs10910034
A allele of rs260509
52% frequency 51% frequency
11kb away on chr 1
LD differences between populations are large!
68% frequency 50% frequency C allele of rs10910034
A allele of rs260509
52% frequency 51% frequency
11kb away on chr 1 r2 = 0.97 r2 = 0.34
HapMap project: a resource for “SNP tagging”
Individuals
1 2 3 4 5 6 7 8
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
SNP 1
SNP 2 3 billion
letters
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
1 1
0 0
0 0
1 1
0 0
0 0
0 0
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 SNP 3
SNP1 “tags” this entire haplotype block at an r2 of 0.7
HapMap project: a resource for “SNP tagging”
How to select SNPs to genotype in an association study:
• Choose genomic region(s) of interest.
• Look up HapMap SNPs in the genomic region(s).
• Choose a subset of HapMap SNPs which “tag” haplotype
blocks in the genomic region(s).
(e.g. Tagger algorithm, de Bakker et al. 2005 Nat Genet)
Note: because LD patterns vary by population, it is
important to choose tag SNPs using a HapMap population
similar to the population in the association study.
HapMap project: a resource for “SNP tagging”
International HapMap Consortium 2007 Nature; also see Barrett et al. 2006 Nat Genet,
Smith et al. 2006 Genomics, International HapMap Consortium 2005 Nature
How many “tag SNPs” are required?
For the entire genome, the answer is:
Thus, to choose tag SNPs at an r2 of 0.8, we need roughly
1 SNP per 3kb in YRI, or 1 SNP per 5kb in CEU or CHB+JPT
Things aren’t always what they seem
Things aren’t always what they seem
• Estimating LD using a small number of HapMap samples
may lead to overfitting.
• HapMap SNPs are not a random subset of SNPs.
Things aren’t always what they seem
• Estimating LD using a small number of HapMap samples
may lead to overfitting.
• HapMap SNPs are not a random subset of SNPs.
Bhangale et al. 2008 Nat Genet
Things aren’t always what they seem
According to International HapMap Consortium 2007 Nature:
82% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0
According to Bhangale et al. 2008 Nat Genet:
66% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0
Bhangale et al. 2008 Nat Genet
Multi-SNP tagging
Haplotype
1 2 3 4 [freq. 25% for each haplotype]
SNP1 A A C C
SNP2 A C C A
SNP3 A C A C
r2=0,
NOT
in LD
(causal)
Multi-SNP tagging
Haplotype
1 2 3 4 [freq. 25% for each haplotype]
SNP1 A A C C
SNP2+3 A+A C+C C+A A+C r2=1,
YES
in LD
(causal)
Multi-SNP tagging
Pe’er et al. 2006 Nat Genet
also see Zaitlen et al. 2007 Am J Hum Genet
Outline
1. Introduction to Linkage Disequilibrium
2. LD and Tag SNPs
3. LD and imputation
4. LD and fine-mapping
What is imputation?
Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010
Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics
What is imputation?
? Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010
Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics
Imputation: Why try?
• Increase power to detect disease association at untyped causal SNP
(imputed causal SNP may have stronger association than tag SNP)
Imputation: Why try?
r2 = 0.8
Causal SNP
Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010
Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics
Imputation: Why try?
Causal SNP
Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010
Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics
Imputation: Why try?
• Increase power to detect disease association at untyped causal SNP
(imputed causal SNP may have stronger association than tag SNP)
Imputation: Why try?
• Increase power to detect disease association at untyped causal SNP
(imputed causal SNP may have stronger association than tag SNP)
• Enable meta-analysis of studies on Affymetrix + Illumina chips
Imputation: Why try?
• Increase power to detect disease association at untyped causal SNP
(imputed causal SNP may have stronger association than tag SNP)
• Enable meta-analysis of studies on Affymetrix + Illumina chips
• Improve genotype data quality
Imputation: Algorithms
Hidden Markov Model (HMM) based approaches:
• IMPUTE (Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet,
Howie et al. 2012 Nat Genet)
• MACH (Li et al. 2010 Genet Epidemiol)
• fastPHASE/BIMBAM (Scheet/Stephens 2006 AJHG, Servin/Stephens 2007
PLoS Genet, Guan/Stephens 2008 PLoS Genet)
• GEDI (Kennedy et al. 2008 ISBRA)
Localized Haplotype Clustering:
• BEAGLE (Browning/Browning 2007 AJHG, Browning/Browning 2009 AJHG)
Likelihood-based approaches:
• UNPHASED (Dudbridge 2008 Hum Hered)
• SNPMStat (Lin et al. 2008 AJHG)
reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG
Imputation: What do the algorithms output?
Integer-valued genotypes at untyped SNPs
e.g. genotype = 2
OR
Continuous genotype dosages at untyped SNPs
e.g. genotype dosage = 1.79
OR
Continuous genotype probabilities at untyped SNPs
e.g. genotype probabilities P(0) = 0.01, P(1) = 0.19, P(2) = 0.80
Imputation: People do it.
reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG
HMM-based imputation approaches
hap1
hap2
hap3
hap4
hap5
Imp.
reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG
? ? ?
Note: current paradigm is to first phase the data, then run imputation on
phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)
HMM-based imputation approaches
hap1
hap2
hap3
hap4
hap5
Imp.
reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG
Note: current paradigm is to first phase the data, then run imputation on
phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)
Measuring imputation accuracy
Concordance rate: % of genotypes (or alleles) imputed correctly
• Natural analogue of genotyping error rate in QC analyses
• Concordance rate is often in the range of 95-99%.
Squared correlation (r2) between true and imputed genotype
• Natural analogue of r2 between causal SNP and tag SNP
• r2 << concordance rate, particularly for rare SNPs.
Measuring imputation accuracy
Concordance rate: % of genotypes (or alleles) imputed correctly
• Natural analogue of genotyping error rate in QC analyses
• Concordance rate is often in the range of 95-99%.
Squared correlation (r2) between true and imputed genotype
• Natural analogue of r2 between causal SNP and tag SNP
• r2 << concordance rate, particularly for rare SNPs.
Measuring imputation accuracy
Concordance rate: % of genotypes (or alleles) imputed correctly
• Natural analogue of genotyping error rate in QC analyses
• Concordance rate is often in the range of 95-99%.
Squared correlation (r2) between true and imputed genotype
• Natural analogue of r2 between causal SNP and tag SNP
• r2 << concordance rate, particularly for rare SNPs.
Measuring imputation accuracy
Concordance rate: % of genotypes (or alleles) imputed correctly
• Natural analogue of genotyping error rate in QC analyses
• Concordance rate is often in the range of 95-99%.
Squared correlation (r2) between true and imputed genotype
• Natural analogue of r2 between causal SNP and tag SNP
• r2 << concordance rate, particularly for rare SNPs.
Normalized difference between true and imputed allele frequency
• Measures whether imputation is biased towards ref or var allele
Imputation using HapMap data
International HapMap3 Consortium 2010 Nature
common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95
(European-ancestry WTCCC samples, Affymetrix & Illumina chips)
Imputation using HapMap data
International HapMap3 Consortium 2010 Nature
common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95
common SNPs imputed using HapMap3 CEU+TSI (N=410): r2 = 0.96
(European-ancestry WTCCC samples, Affymetrix & Illumina chips)
Imputation using HapMap data
International HapMap3 Consortium 2010 Nature
x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)
y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)
Imputation using HapMap data
International HapMap3 Consortium 2010 Nature
x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)
y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)
Imputation using HapMap data
International HapMap3 Consortium 2010 Nature
x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)
y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)
Low-coverage sequencing + imputation
increases power vs. genotyping arrays
Cost per
sample
Actual
#samples
Average
imputation r2
Effective
#samples
Illumina 1M array $400 750 1.00 750
0.4x sequencing $83* 3,600 0.81** 2,900
0.1x sequencing $43* 7,000 0.64** 4,500
Pasaniuc et al. 2012 Nat Genet; also see Cai et al. 2015 Nature, Davies et al. 2016 Nat Genet
Effective sample size of a GWAS with a $300,000 budget:
*Based on sample preparation cost of $30/sample, which is conservatively
double the $15/sample reported by Rohland & Reich 2012 Genome Res,
and on $133 per 1x sequencing (Illumina Network cost).
**Imputation r2 attained at Illumina 1M SNPs by downsampling reads from
real off-target exome sequencing data. Relative performance of
low-coverage sequencing will be even higher at non-Illumina 1M SNPs.
Outline
1. Introduction to Linkage Disequilibrium
2. LD and Tag SNPs
3. LD and imputation
4. LD and fine-mapping (to be continued, Tue of Week 4)
Definition of fine-mapping
Manhattan plot from Ikram et al. 2010 PLoS Genet
Which of these SNPs on chr 6 is the biologically causal SNP?
(Ditto for chr 5, 8, 12, 19)
WTCCC fine-mapping study
Maller et al. 2012 Nat Genet
GWAS in Europeans
SNP1: P-value = 10-8
LD and fine-mapping in Europeans
TCF7L2 locus in T2D: 1 top signal
Maller et al. 2012 Nat Genet
Fine-mapping in Europeans
SNP1: P-value = 10-8 CAUSAL??
SNP2: P-value = 10-8 CAUSAL??
LD and fine-mapping in Europeans
FTO locus in T2D: many top signals
Maller et al. 2012 Nat Genet
Fine-mapping in Europeans Fine-mapping in Africans
SNP1: P-value = 10-8 SNP1: P-value = 10-5
SNP2: P-value = 10-8 SNP2: P-value = 0.62
SNP3: P-value = 0.41 SNP3: P-value = 10-5
LD in Europeans LD in Africans
LD and cross-population fine-mapping
r2 SNP1 SNP2 SNP3
SNP1 1.00 0.99 0.08
SNP2 0.99 1.00 0.07
SNP3 0.08 0.07 1.00
r2 SNP1 SNP2 SNP3
SNP1 1.00 0.12 0.98
SNP2 0.12 1.00 0.14
SNP3 0.98 0.14 1.00
Fine-mapping in Europeans Fine-mapping in Africans
SNP1: P-value = 10-8 SNP1: P-value = 10-5 CAUSAL
SNP2: P-value = 10-8 SNP2: P-value = 0.62
SNP3: P-value = 0.41 SNP3: P-value = 10-5
LD in Europeans LD in Africans
LD and cross-population fine-mapping
r2 SNP1 SNP2 SNP3
SNP1 1.00 0.99 0.08
SNP2 0.99 1.00 0.07
SNP3 0.08 0.07 1.00
r2 SNP1 SNP2 SNP3
SNP1 1.00 0.12 0.98
SNP2 0.12 1.00 0.14
SNP3 0.98 0.14 1.00
LD and multi-ethnic fine-mapping
Zaitlen*, Pasaniuc* et al. 2010 Am J Hum Genet
also see Morris 2011 Genet Epidemiol, Udler et al. 2009 Hum Mol Genet,
Wu et al. 2013 PLoS Genet, Peters et al. 2013 PLoS Genet, Liu et al. 2016 Am J Hum Genet
• Linkage Disequilibrium is good, because we can tag most
common SNPs using chips with 1,000,000 SNPs or less.
• Linkage Disequilibrium is good, because we can infer
imputed genotypes at most common HapMap SNPs.
• Linkage Disequilibrium is bad, because it leads to
ambiguity as to the causal SNP when doing fine-mapping.
• Studying multiple populations, especially Africans (low LD),
can improve our ability to localize causal variants.
Conclusions
EPI 511: Office Hours
Instructor: Alkes
Office Hours: Thu 3:30-4:30pm, Building 2, Room 211
Email Address: [email protected]
(Please put EPI511 in the subject of your email)
Teaching Assistant: Armin
Office Hours: Fri + Mon 2-3pm, Building 2, Room 209
Email Address: [email protected]