population structure and association analysissssykim/teaching/s13/slides/lecture_sa.pdf ·...
TRANSCRIPT
![Page 1: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/1.jpg)
Population Structure and Association Analysis
02-‐715 Advanced Topics in Computa8onal Genomics
![Page 2: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/2.jpg)
Population Structure and Association Analysis
• Popula8on structure in data causes false posi8ves – Samples in the case popula8on are usually more related
– Any SNPs more prevalent in the case popula8on will be found significantly associated with the trait.
![Page 3: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/3.jpg)
Accounting for Population Structure in Association Analysis
• Needs to account for popula8on structure in associa8on mapping.
• Careful study design with each popula8on represented in case/control groups in a balanced way. – Can be hard to control – The effect of cryp8c popula8on structure
![Page 4: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/4.jpg)
Family-based Design vs. Population-based Design
• Family-‐based studies – The effect of popula8on structure can be controlled by the use of
parents’ genotypes.
– In prac8ce, collec8ng genotypes from mul8ple individuals in a family can be hard. (e.g., late-‐onset diseases)
• Popula8on-‐based design – Data collec8on is easier for a large number of unrelated individuals
than a large number of families.
– The control samples can be reused in different studies.
![Page 5: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/5.jpg)
Accounting for Population Structure in Association Analysis
• Family-‐based method – Transmission disequilibrium test (TDT)
• Popula8on-‐based method – Genomic control (Devlin & Roeder, Biometrics 1999)
– Structured associa8on (Pritchard et al., AJHG 2000)
– EigenStrat: principal component analysis (Price et al., Nature Gene8cs 2006)
![Page 6: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/6.jpg)
Transmission Disequilibrium Test (TDT)
Non-‐transmi+ed alleles
Transmi+ed alleles M m total
M a b a+b
m c d c+d
Total a+c b+d 2N
• Genotype affected individuals and their parents (trio)
• Null hypothesis: (b/(b+c), c/(b+c)) is compa8ble with (0.5, 0.5) • Test sta8s8c is given as (b-‐c)2/(b+c)
• The non-‐transmi[ed alleles play the role of controls
![Page 7: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/7.jpg)
Genomic Control (GC)
• Idea: Use the SNPs that are not associated with the trait to remove the effect of popula8on stra8fica8on
• Genotype data consist of – Candidate genes to be tested – L supplementary loci (null loci) for es8ma8ng the infla8on factor λ
• GC uses the infla8on factor λ to correct the associa8on sta8s8c of the SNP in the candidate gene
• Limita8on: the infla8on factor λ is assumed to be the same across the genome, ignoring popula8on admixture
Devlin & Roeder, Biometrics 1999
![Page 8: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/8.jpg)
STRAT: Structured Association (Pritchard et al., AJHG 2000)
• Idea: Within each subpopula8on, an associa8on between a gene8c marker and the trait is a true associa8on.
• Two-‐stage method – Step 1: Using Structure (Pritchard et al., Gene8cs 2000) and unlinked
gene8c markers, • es8mate the popula8on structure • assign sampled individuals to puta8ve subpopula8ons
– Step 2: • Test for associa8on within the subpopula8ons inferred in Step 1
• Limita8on – Running Structure is computa8onally demanding
Pritchard et al., AJHG 2000
![Page 9: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/9.jpg)
STRAT: Step 2
• Given ancestry propor8ons qk(i) for popula8on k, individual i es8mated by STRUCTURE
• H0: The probability model for genotypes c’s under the null hypothesis of no associa8on
• H1: The probability model for genotypes c’s the alterna8ve hypothesis of associa8on
![Page 10: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/10.jpg)
STRAT: Step 2
• Likelihood ra8o test:
– Large values indicate that the alterna8ve hypothesis explains the data be[er.
![Page 11: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/11.jpg)
Simulation Studies: No Admixture
• Assume two discrete popula8ons
• Simulate genotypes of 150 affected and 150 control individuals at 100 unlinked loci – With sample size N, we have 2N chromosomes
– Assume two popula8ons have split 0.05N genera8ons ago without migra8on
– Controls: half of the controls came from each of the two subpopula8ons
– Affected group: 100 from popula8on 1, 50 from popula8on 2
![Page 12: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/12.jpg)
STRAT: Simulation Results
• Rejec8on rates under the null hypothesis of no associa8on
• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus
![Page 13: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/13.jpg)
Simulation Studies: With Admixture
• Assume two discrete popula8ons
• Simulate genotypes of 500 affected and 500 control individuals at 150 unlinked microsatellite loci – With sample size N, we have 2N chromosomes
– Assume two popula8ons have split 0.15N genera8ons ago, followed by two genera8ons of admixing
– Controls: random draws from the whole popula8on
– Affected group: random draws from the whole popula8on assuming a disease risk mode for grand parents
![Page 14: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/14.jpg)
Structure: Simulation Results
• Learning popula8on structure using genotypes from two recently admixed popula8ons – Dashed line – case group
![Page 15: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/15.jpg)
STRAT: Simulation Results
• Rejec8on rates under the null hypothesis
• p1,p2: allele frequencies for popula8ons 1 and 2 at the given locus
![Page 16: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/16.jpg)
TDT vs. STRAT
• TDT – Requires genotyping parents of the affected offspring
• STRAT – Requires genotypes for addi8onal loci to infer popula8on structure
with STRUCTURE
![Page 17: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/17.jpg)
EigenStrat
• Structured associa8on approach
• Step 1: Run PCA on genotype data to infer the popula8on structure
• Step 2: Perform associa8on analysis afer correc8ng for the popula8on effects in genotype/phenotype data
• Advantages: low computa8onal cost compared to STRAT
![Page 18: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/18.jpg)
EigenStrat: Structured Association with PCA
• Step 1: (Inferring Ancestry) PCA is applied to genotype data to infer con8nuous axes of gene8c varia8on
Price et al., Nature Gene8cs 2006
![Page 19: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/19.jpg)
What are the new axes?
Original Variable A
PC 1 PC 2
• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis
Original Variable B
![Page 20: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/20.jpg)
EigenStrat: Structured Association with PCA
• Step 2: (Removing Ancestry Effects) Genotype at a candidate SNP and phenotype are con8nuously adjusted by amounts a[ributable to ancestry along each axis
• Step 3: (Associa8on test)
![Page 21: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/21.jpg)
Simulation Procedure
• Given FST, For each SNP – Draw an ancestral popula8on allele frequency p from uniform
distribu8on [0.1 0.9]
– Allele frequencies for popula8ons 1 and 2, p1 and p2, are drawn from Beta(p(1-‐FST)/FST, (1-‐p)(1-‐FST)/FST)
– Draw SNPs using popula8on allele frequencies p1 and p2
![Page 22: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/22.jpg)
Simulation Study
• Discrete popula8ons vs. admixed popula8ons
• Moderate vs. extreme ancestry differences in the ancestry between cases/controls – Moderate: control (40% popula8on 1, 60% popula8on 2), case (60%
popula8on 1, 40% popula8on 2) – Extreme: control (0% popula8on 1, 100% popula8on 2), case (50%
popula8on 1, 50% popula8on 2)
• Datasets with candidate loci selected as follows – Random SNPs (no associa8ons) – Differen8ated SNPs (a large difference in allele frequencies between
popula8ons, but no associa8ons) • Allele frequence 0.8 for popula8on 1, 0.2 for popula8on 2
– Causal SNPs
![Page 23: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/23.jpg)
Simulation Results
![Page 24: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/24.jpg)
Simulation Results
# SNPs required FST
20,000 0.005
50,000 0.002
100,000 0.001
• To correct for popula8on stra8fica8on, a greater number of SNPs are required for less differen8ated popula8ons
![Page 25: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/25.jpg)
PCA for Population Structure Discovery
Gene8c varia8on between northwest and southeast Europe
Gen
e8c varia8
on between tw
o southe
ast
Europe
an pop
ula8
ons
![Page 26: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/26.jpg)
European American Dataset
• 488 European Americans genotyped at 116,204 SNPs
• A muta8on in LCT gene is 100% associated with lactase persistence phenotype – This muta8on was not included in this dataset – Look for an indirect associa8on between a nearby SNP rs3769005,
which is in 90% LD with the LCT muta8on based on HapMap data, and the phenotype
• The region in chromosome 2 surrounding LCT gene is highly associated with the phenotype due to the the strong selec8ve sweep in that region.
![Page 27: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/27.jpg)
Association Results for SNPs Outside of Chromosome 2 (LCT gene)
![Page 28: Population Structure and Association Analysissssykim/teaching/s13/slides/Lecture_SA.pdf · Population Structure and Association Analysis • Populaon’structure’in’datacauses’false’posi8ves’](https://reader031.vdocuments.us/reader031/viewer/2022022600/5b3df1967f8b9a986e8e0167/html5/thumbnails/28.jpg)
Summary
• Genomic Control – Cannot handle the effect of admixed popula8ons
• STRAT: structured associa8on with STRUCTURE – Uses a genera8ve model that explicitly models admixture
– Computa8onally demanding
• EigenStrat – Does not provide intui8on behind the admixing process
– Significantly low computa8onal cost than STRAT