epi 511, advanced population and medical genetics

Alkes Price

Harvard School of Public Health

January 24 & January 26, 2017

EPI 511, Advanced Population and Medical Genetics

Week 1:

• Intro + HapMap / 1000 Genomes

• Linkage Disequilibrium

EPI 511: Course structure

Week 1: HapMap, 1000G / Linkage disequilibrium

Week 2: Population structure and admixture

Week 3: Population stratification

Week 4: Fine-mapping / Natural selection

Week 5: Heritability / Genetic risk prediction

Week 6: Mixed models / Rare variant analysis

Week 7: Functional interpretation

EPI 511: How to address the instructor

Alkes

Dr. Price

Professor Price

Honorable Professor Price

Honorable Distinguished Dr. Professor Price

EPI 511: Office Hours

Instructor: Alkes

Office Hours: Thu 3:30-4:30pm, Building 2, Room 211

Email Address: [email protected]

(Please put EPI511 in the subject of your email)

Teaching Assistant: Armin

Office Hours: Fri + Mon 2-3pm, Building 2, Room 209


https://www.google.com/imgres?imgurl=http%3A%2F%2Fsysbiophd.harvard.edu%2Ffiles%2Fstyles%2Fperson%2Fpublic%2Fimages%2Fpeople%2FGEN_PIC_19.jpg%3Fitok%3Dr7fW4DNW&imgrefurl=http%3A%2F%2Fsysbiophd.harvard.edu%2Fpeople%2F2014%2Farmin-schoech&docid=b-tsMg0TMrs9cM&tbnid=oatiQ3SwFiOgeM%3A&vet=1&w=150&h=150&bih=598&biw=1033&ved=0ahUKEwiMopbTrNXQAhUCOSYKHRbXB1AQMwgdKAAwAA&iact=mrc&uact=8








http://sysbiophd.harvard.edu/people/2014/armin-schoech

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session



• Lecture + Discussion



• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

http://www.collegeaffairs.in/tips-for-students/group-discussion-tips-everyone/




Video of each class will be posted on

the course www site <1hr after class.

http://starspangledmusic.org/singing-the-ssb-video-lesson-4/




• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28





• short Research Paper due Fri Mar 10





• short Research Paper due Fri Mar 10

• self-assessment Opportunity

20min exam (date will not be announced in advance)

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class




• Experiences (60% of course grade) 6 take-home projects (data and programming intensive)

Approaches to Scientific Understanding

Love is Understanding.

-- Madonna

Data is Understanding.

-- Alkes


Understanding Data requires Fixing Bugs.

Genetics + data + programming = bright future

Gewin 2007 Nature Hayden 2012 Nature





• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)





• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)

• self-assessment Opportunity (0% of course grade)

20min exam (date will not be announced in advance)

EPI 511: Policy on group work

Experiences (60% of course grade) 6 take-home projects (data and programming intensive)

• OK to discuss experiences with your colleagues

• Each piece of code you write should be your own

short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)

• OK to discuss the project with your colleagues

• Each piece of code you write should be your own

• Each piece of text you write should be your own


Week 1:

• Introduction + HapMap Project


Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

What is Population Genetics?

Population genetics is the study of genetic variation

both within and between human populations.

Are different human populations

actually genetically different?

Are different human populations

actually genetically different?

Slightly.

5-7% of worldwide human genetic variation is due to

genetic differences between human populations.

The remaining 93-95% of human genetic variation is due to

genetic variation within human populations

(Rosenberg et al. 2002 Science).

Why study differences between

human populations?

• Learn about human migration patterns and ancient history.


human populations?


• Improve our power to identify and localize disease genes.

Rosenberg et al. 2010

Nat Rev Genet

Bustamante et al. 2011 Nature; also see Popejoy & Fullerton 2016 Nature


human populations?



Williams et al. 2014 Nature


human populations?



- Use differences in linkage disequilibrium for fine-mapping.

- Avoid false positives due to population stratification.

- Signals of natural selection at genes related to disease.

Does “race” exist?

Does “race” exist?

Worldwide patterns of human genetic variation are best

described using continuous clines instead of discrete clusters.

(Serre & Paabo 2004 Genome Res)

Racial classifications are inadequate descriptors of the

distribution of human genetic variation.

(Tishkoff & Kidd 2004 Nat Genet)

For a fun time: go to a population genetics party and ask,

Isn’t it politically incorrect to study

differences between human populations?



No. It is not politically incorrect.



No. It is not politically incorrect.

“Studies of human population genetics have generated the

strongest proof that there is no scientific basis for racism.”

(Cavalli-Sforza 2005 Nat Rev Genet)

also see Cavalli-Sforza et al. 1994 The History and Geography of Human Genes

Outline



3. FST


The International HapMap Project (International HapMap Consortium 2005 Nature)

CEU (European) CHB (Chinese)

JPT (Japanese) YRI (Nigerian)

CEU northern European USA 90

CHB Chinese China 45

JPT Japanese Japan 44

YRI Yoruba Nigeria 90

The International HapMap Project: 270 samples from 4 populations




Phase I HapMap:

>1,000,000 SNPs




Phase II HapMap:

>3,000,000 SNPs

What is a SNP?

A Single Nucleotide Polymorphism (SNP) is a letter of the

genome that differs in different individuals (e.g. G/T).

What is a SNP?

Rosenberg & Nordborg 2002 Nat Rev Genet

A Single Nucleotide Polymorphism (SNP) is a letter of the

genome that differs in different individuals (e.g. G/T).

Each SNP corresponds to one single mutation event in history,

e.g. G mutated to T in one single ancestor.

G = ancestral allele, T = derived allele.

Coalescent tree

What is a SNP: physical position

Each SNP has a physical position on a chromosome.

physical

chrom. position (bp)

rs10910034 1 2165898

rs1713712 1 2166021

… … …

What is a SNP: physical vs. genetic position

Each SNP has a physical and genetic position on a chromosome.

physical genetic position

chrom. position (Morgans)

rs10910034 1 2165898 0.01904785

rs1713712 1 2166021 0.01904814

… … … …

1 recombination event per Morgan per generation.

Genome-wide recombination rate is about 1cM / Mb.

[cM = centiMorgan = 1/100 Morgan, Mb = Megabase = 106 bp]

Thus, 1 Morgan is roughly 100Mb = 108 bp on average.

HapMap project: Summary of main results

• 3.1 million SNPs successfully genotyped using Perlegen

genotyping technology (Hinds et al. 2005 Science).

• These 3.1 million SNPs: about 30% of all common SNPs

(defined as SNPs with minor allele frequency >5%).





HapMap: 270 samples from 4 populations

Affymetrix and

Illumina chips

HapMap project: Summary of main results

• 3.1 million SNPs successfully genotyped using Perlegen

genotyping technology (Hinds et al. 2005 Science).

• These 3.1 million SNPs: about 30% of all common SNPs

(defined as SNPs with minor allele frequency >5%).

“Properties of SNPs are influenced by discovery sampling …

HapMap relied on nearly any piece of information available.”

Clark et al. 2005 Genome Res; also see Keinan et al. 2007 Nat Genet

Summary of main results, continued

• Understanding genetic differences between populations.

• Patterns of linkage disequilibrium both within and across

populations.

• Most common SNPs in the human genome are in strong

linkage disequilibrium with at least one HapMap SNP

[avg r2 ≥ 0.90 in 10 sequenced ENCODE regions].

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

77% frequency

68% frequency

50% frequency C allele of rs10910034


FST = 0.19

FST = 0.11

FST = 0.16

Note: FST accounts for

sampling error due to

finite sample size.

Populations can be distinguished using

a large number of genetic markers

Principal Components Analysis

using 100 markers

Populations can be distinguished using

a large number of genetic markers

using 3 million markers

Principal Components Analysis

Outline



3. FST



FST = 0.19

FST = 0.11

FST = 0.16

Defining vs. Estimating FST

• FST is an underlying parameter that depends on the two

populations, but does not depend on a particular finite sample.

• FST is an estimate of the underlying FST that depends on a

particular finite sample that is analyzed.

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

^

Defining FST

Definition:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

p

p2 p1

FSTp(1 – p) FSTp(1 – p)


Defining FST

Definition:





p1 ~ N(p, FSTp(1 – p))

p

p2 p1



Defining FST

Definition:





p1 ~ Beta(p(1 – FST)/FST, (1 – p)(1 – FST)/FST)

p

p2 p1



Defining FST

Definition:





OR

• The FST between two populations is equal to the proportion

of genotypic variance in a set of N individuals from each

population that is attributable to population differences.


Defining FST

Theorem 1:





=>

• The FST between two populations is equal to the proportion

of genotypic variance in a set of N individuals from each

population that is attributable to population differences.

Defining FST

Proof: Let pavg = (p1 + p2)/2.

Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)

[Note that individuals are diploid: genotype = 0 or 1 or 2.

Binomial sampling with n=2.]

Defining FST





Genotypic variance attributable to population differences:

Suppose we have N data points with value 2p1, N with value 2p2

After subtracting the average value (p1 + p2), we have

N data points with value (p1 – p2), N with value (p2 – p1).

Since p1 and p2 each have variance FSTp(1 – p), it follows that

(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)

Defining FST





Genotypic variance attributable to population differences:

Suppose we have N data points with value 2p1, N with value 2p2

After subtracting the average value (p1 + p2), we have

N data points with value (p1 – p2), N with value (p2 – p1).

Since p1 and p2 each have variance FSTp(1 – p), it follows that

(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)

2FSTp(1 – p) / 2p(1 – p) = FST. Q.E.D.

Defining FST

Theorem 1′:





=>

• The proportion of genotypic variance in a set of

αN individuals from population 1 and (1 – α)N individuals

from population 2 that is attributable to population differences

is equal to 4α(1 – α) · FST.


FST = 0.19

FST = 0.11

FST = 0.16


FST = 0.19

FST = 0.11

FST = 0.16

[2FSTp(1 – p)]1/2 = 0.23

for p = 0.5

[2FSTp(1 – p)]1/2 = 0.31

for p = 0.5

[2FSTp(1 – p)]1/2 = 0.28

for p = 0.5

Genetic distances (FST) between

European American subpopulations

Ashkenazi

Northwest Eur. Southeast Eur.

FST = 0.009 FST = 0.004

FST = 0.005

Price, Butler et al. 2008 PLoS Genet


European American subpopulations

Ashkenazi

Northwest Eur. Southeast Eur.

FST = 0.009 FST = 0.004

FST = 0.005

Price, Butler et al. 2008 PLoS Genet

[2FSTp(1 – p)]1/2 = 0.067 for p = 0.5

[2FSTp(1 – p)]1/2 = 0.050 for p = 0.5

[2FSTp(1 – p)]1/2 = 0.045 for p = 0.5


East Asian subpopulations

FST = 0.007

International HapMap Consortium 2007 Nature

Chinese Japanese

[2FSTp(1 – p)]1/2 = 0.059 for p = 0.5


West African subpopulations

FST = 0.008

International HapMap3 Consortium 2010 Nature

[2FSTp(1 – p)]1/2 = 0.063 for p = 0.5

Yoruba

(Nigeria)

Luhya

(Kenya)

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).



Var(p1 – p2) = 2FSTp(1 – p).


= E((p1 – p2)2 / [2p(1 – p)]).

A PROBLEM: we don’t get to observe p (ancestral frequency)

SOLUTION: approximate p ≈ pavg = (p1 + p2)/2.



Var(p1 – p2) = 2FSTp(1 – p).


= E((p1 – p2)2 / [2p(1 – p)]).

A BIGGER PROBLEM: we don’t get to observe p1 and p2.

We only get to observe sample allele frequencies p1 and p2

in sample sizes N1 (from pop. 1) and N2 (from pop. 2).

^ ^



Var(p1 – p2) = 2FSTp(1 – p).


= E((p1 – p2)2 / [2p(1 – p)]).

SOLUTION:

Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate

FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)])

(where we approximate p ≈ (p1 + p2)/2)

^ ^

^ ^

^ ^

some details omitted; see Bhatia et al. 2013 Genome Res



Var(p1 – p2) = 2FSTp(1 – p).


= E((p1 – p2)2 / [2p(1 – p)]).

SOLUTION:

Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate

FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)]).

OR FST = Σi [(pi1 – pi2)2 – (1/(2N1) + 1/(2N2))pi(1 – pi)]

Σi [2pi(1 – pi)]

^ ^

^ ^

some details omitted; see Bhatia et al. 2013 Genome Res

^ ^ (where i

indexes

SNPs)

Drift vs. Divergence

YRI CHB CEU

0.02

0.04 0.07

0.10

YRI YRI CEU CEU CHB CHB

Divergence

(per 1000bp of DNA)

0.84 0.60 0.57

Keinan et al. 2007 Nat Genet

NA18488 NA06989 NA18597

Drift

(FST)

Drift vs. Divergence

Drift

(FST)

YRI CHB CEU

0.02

0.04 0.05

0.10

YRI YRI CEU CEU CHB CHB

Divergence

(generations)

~30K

gen.

Keinan et al. 2007 Nat Genet

NA18488 NA06989 NA18597

Based on mut. rate 1.2–1.8 x 10-8

(Kong et al. 2012 Nature,

Sun et al. 2012 Nat Genet)

~20K

gen.

~20K

gen.

Outline



3. FST






HapMap: 270 samples from 4 populations

Affymetrix and

Illumina chips

Perkel 2008 Nat Methods

The HapMap Project:

Work is done, relax on beach?

Beyond HapMap: what the world still needs

• Larger sample sizes for analyses of linkage disequilibrium

• More complete representation of world population diversity

e.g. South Asian and Native American genetic variation

• Analyses of copy number variation (CNV)

• Low-frequency variants (minor allele frequency <5%)

The International HapMap3 Project:

1,260 samples from 11 diverse populations






TSI Tuscan Italy 90

CHD Chinese USA 100

LWK Luhya Kenya 90

MKK Maasai Kenya 180

ASW African-American USA 90

MXL Mexican-American USA 90

GIH Gujarati-American USA 90

HapMap3: 1,260 samples from 11 populations

The HapMap3 project




• Analyses of copy number variation (CNV)



Data generation: SNPs and CNVs

Affymetrix 6.0 array

900K SNPs

940K copy-number probes

Illumina Infinium 1M array

1M SNPs, of which

80K targeted at CNV regions

1.5M SNPs passed QC in all populations

(99.3% concordance for 250K SNPs on both arrays)

Note: only 1.5M SNPs, versus 3.1 million SNPs in HapMap2


Not all HapMap3 populations are

similar to a population from HapMap

HapMap3 population Closest pop.

from HapMap

FST

TSI (Tuscan) CEU 0.004

CHD (Chinese) CHB 0.001

LWK (Luhya) YRI 0.008

MKK (Maasai) YRI 0.03

ASW (African-American) YRI 0.01

MXL (Mexican-American) CEU 0.04

GIH (Gujarati-American) CEU 0.04



-- Madonna


-- Alkes

HapMap3 data: individual files

CEU.ind:

NA06989 F CEU

NA11891 M CEU

NA11843 M CEU

NA12341 F CEU

NA12739 M CEU

…

[sample ID] [sex] [popname]

HapMap3 data: SNP files

CEU.snp:

rs10458597 1 0.0 554484 C T

rs2185539 1 0.0 556738 C T

rs11240767 1 0.0 718814 C T

rs12564807 1 0.0 724325 A G

rs3131972 1 0.0 742584 G A

…

[SNP ID] [chr] [0.0] [position] [ref] [var]

HapMap3 data: genotype files

CEU.geno:

2222222222… [Each line is 1 SNP, each column is 1 indiv.]

2222222222…

2222222222…

2222222222…

1121212112…

…

[Number of copies of reference allele: 0 or 1 or 2.

9 denotes missing data.]

Note: the HapMap3 data files for this course are restricted to

~700K SNPs that are common (MAF>5%) in every population.

Beyond HapMap: what the world still needs




• Analyses of copy number polymorphisms (CNV)


Common Disease/Common Variant hypothesis

Lander 1996 Science; Reich & Lander 2001 Trends Genet

reviewed in Gibson 2012 Nat Rev Genet, Visscher et al. 2012 Am J Hum Genet

“For common diseases, there will be one or a few

predominating disease alleles with relatively high frequencies at

each of the major underlying disease loci”

Are rare and low-frequency variants important?

Visscher et al. 2012 Am J Hum Genet

(to be continued, Thu of Week 6)


Gibson 2012 Nat Rev Genet



Kaiser 2012 Science (to be continued, Thu of Week 6)

HapMap3 1Mb pilot sequencing study

and 1000 Genomes pilot projects


1000 Genomes Project Consortium 2010 Nature

• HapMap3 pilot sequencing: 10 100kb regions spanning 1Mb (high coverage: Sanger sequencing)

692 individuals from 10 HapMap3 populations

• 1000 Genomes Trio pilot project: Genome-wide (high coverage: 42x)

6 individuals (one CEU trio and one YRI trio)

• 1000 Genomes Low-coverage pilot project: Genome-wide (low coverage: 2x-6x)

179 individuals from CEU, YRI, CHB, JPT populations

• 1000 Genomes Exon pilot project: 8,140 exons spanning 1.4Mb from 906 genes (high coverage: >50x)

697 individuals from 7 HapMap3 populations

Sample size and SNP discovery (per Mb)


The 1000 Genomes (1000G) Project

Sequence the entire genomes of 1,092 individuals:

379 of European ancestry (Europe and USA)

286 of East Asian ancestry (Asia)

246 of African ancestry (Africa and USA)

181 of Latino ancestry (Latin America and USA)

Use next-generation sequencing technologies (~4x coverage):

e.g. Illumina, 454, SOLiD (read lengths 25-400bp)

(Metzker 2010 Nat Rev Genet, Davey et al. 2011 Nat Rev Genet,

also see Nielsen et al. 2011 Nat Rev Genet)


1000G project: Summary of main results

• 38 million SNPs discovered and successfully genotyped.

Most of these are rare and low-frequency variants.

• The 38 million SNPs include

99.7% of all SNPs with minor allele frequency 5%

98% of all SNPs with minor allele frequency 1% ***

50% of all SNPs with minor allele frequency 0.1%

based on an independent UK European sample.

***: stated goal to identify >95% of SNPs with frequency 1%

was successfully achieved.


Common variants are shared across populations,

but rare variants are often population-private


1000G project: the final phase






489 of South Asian ancestry (South Asia and USA)


Illumina only (read lengths 70-400bp only)

85 million SNPs, of which 64 million have MAF<0.5%

Related resource: UK10K project: 7x WGS of 3,781 UK samples

(UK10K Consortium 2015 Nature; also see Gudbjartsson et al. 2015 Nature)


1000G project: the final phase






489 of South Asian ancestry (South Asia and USA)


Illumina only (read lengths 70-400bp only)

85 million SNPs, of which 64 million have MAF<0.5%

1000 Genomes Project Consortium 2015 Nature; also see UK10K Consortium

2015 Nature, Gudbjartsson et al. 2015 Nat Genet, McCarthy et al. 2016 Nat Genet

What about rare variants?

• The 1000G project has identified most low-frequency variants

(minor allele frequency 1%-5%). These variants can be placed

on genotyping arrays or imputed (see Thu of Week 1)

What about rare variants?

• The 1000G project has identified most low-frequency variants

(minor allele frequency 1%-5%). These variants can be placed

on genotyping arrays or imputed (see Thu of Week 1)

• Rare variants: most have not been identified by 1000 Genomes!

Must sequence disease samples directly.

Past focus has been mostly on exome sequencing, but

now shifting to whole-genome sequencing.


Kiezun et al. 2012 Nat Genet, Tennessen et al. 2012 Science, Pasaniuc et al. 2012 Nat Genet,

Purcell et al. 2014 Nature, Do et al. 2015 Nature, Cai et al. 2015 Nature. Reviewed in

Goldstein et al. 2013 Nat Rev Genet, Lee et al. 2014 Am J Hum Genet, Zuk et al. 2014 PNAS

• Human populations are slightly genetically different.

These differences may be important for disease mapping.

(see Thu slides: Linkage Disequilibrium.)

• FST quantifies differences between human populations.

• HapMap, HapMap2, HapMap3 and 1000 Genomes projects

provide a valuable resource for common & low-frequency

variants (but most rare variants have not yet been identified).

Conclusions


Week 1:

• Intro + HapMap / 1000 Genomes





http://www.collegeaffairs.in/tips-for-students/group-discussion-tips-everyone/

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping

Definition: Linkage Disequilibrium (LD) refers to

correlations between genotypes of nearby markers.

Linkage Disequilibrium

Definition: Linkage Disequilibrium (LD) refers to

correlations between genotypes of nearby markers.

Linkage Disequilibrium Association Studies

Linkage Disequilibrium Linkage Mapping

(reviewed in Ott et al. 2015 Nat Rev Genet)

Linkage Disequilibrium

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

A A

G A

T T

A A

C G

T T

G G

C C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A

... …

SNP 1

SNP 2 3 billion

letters

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

A A

T T

A A

G G

T T

G G

T C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A ... …

A A

G A

T T

A A

C G

T T

G G

C T

A A ... …

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

G A

T T

A A

C C

T T

G G

C C

A A ... …

YES,

in LD


Individuals

1 2 3 4 5 6 7 8

A A

G A

T T

A A

C G

T T

G G

C C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A

... …

SNP 1

SNP 2 3 billion

letters

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

A A

T T

A A

G G

T T

G G

T C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A ... …

A A

G A

T T

A A

C G

T T

G G

C T

A A ... …

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

G A

T T

A A

C C

T T

G G

C C

A A ... …

SNP 3

YES,

in LD

NOT

in LD


Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=1,

in LD

r2=0,

NOT

in LD

r2 is squared correlation


Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=1,

in LD

r2=0.7,

partial

LD



Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

... … … … … … … …

SNP 1

SNP 2 3 billion

letters

SNP 3

r2=1,

in LD

r2=0.7,

partial

LD


Genotypes vs. Haplotypes: phasing

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING

Genotypes Haplotypes

Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,

Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,

Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet


Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING


Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,

Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,

Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet


Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING


Fact: r2 between SNP1 and SNP2 (phased haplotype data) equals

r2 between SNP1 and SNP2 (unphased genotype data),

assuming Hardy-Weinberg equilibrium holds

Linkage Disequilibrium: Haplotype Blocks

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

These 3 SNPs form a “haplotype block” with two main haplotypes

LD with phased haplotypes: r2 vs. D′

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Let gA refer to # copies (0, 1) of allele A for the first SNP.

Let gB refer to # copies (0, 1) of allele B for the second SNP.

)1()1(

)(

)()(

)]()()([ 222

BBAA

BAAB

BA

BABA

pppp

ppp

gVargVar

gEgEggEr




Suppose pA < pB < 0.5.

)1()1(

2

2

BBAA

BAAB

pppp

pppr

BAA

BAAB

ppp

pppD




Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.

1

BAA

BAAB

ppp

pppD

BAB

BAA

BBAA

BAAB

ppp

ppp

pppp

pppr

)1()1(

2

2




Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.

e.g. pA = 0.25, pB = 0.4, pAB = 0.25 => r2 = 0.5, D′ = 1

1

BAA

BAAB

ppp

pppD

BAB

BAA

BBAA

BAAB

ppp

ppp

pppp

pppr

)1()1(

2

2

LD with unphased diploid genotypes



Let gA refer to # copies (0, 1, 2) of allele A for the first SNP.

Let gB refer to # copies (0, 1, 2) of allele B for the second SNP.

1

BAA

BAAB

ppp

pppD

...)()(

)]()()([ 22

BA

BABA

gVargVar

gEgEggEr

cannot be directly computed,

since pAB relies on phased data!



-- Madonna


-- Alkes



Haplotype blocks in

216kb region (MHC, chr 6)

x-axis = y-axis =

SNP position in region

D′ and L are measures of LD

(related to r2)

Red indicates high LD

Black indicates low LD

Also see Haploview program, Barrett et al. 2005 Bioinformatics

200 kb

100 kb

0 kb


Europeans

and Asians

Africans

Gabriel et al. 2002 Science

also see Reich 2001 Nature, Daly 2001 Nat Genet


African chromosomes: 50% of the genome lies in

haplotype blocks >22kb.

Europeans and Asians: 50% of the genome lies in

haplotype blocks >44kb.

Longer haplotype blocks in Europeans/Asians due to

out-of-Africa population bottleneck: descended from

small number of ancestors who left Africa 60-40 kya.

Gabriel et al. 2002 Science

also see Reich 2001 Nature, Daly 2001 Nat Genet

A brief history of modern humans

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS

A brief history of modern humans, contradicted

Green et al. 2010 Science, Reich et al. 2010 Nature, Meyer et al. 2012 Science,

Sankararaman et al. 2014 Nature, Vernot & Akey 2014 Science

reviewed in Racimo et al. 2015 Nat Rev Genet

• All non-African populations have ~2% of their genomes

descended from Neanderthals.

• Melanesian populations have ~5% of their genomes

descended from Denisovans, a relative of Neanderthals.

Population bottlenecks increase LD

population

bottleneck

population

bottleneck

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS


Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=0,

NOT

in LD



due to subsampling haplotypes (genetic drift) Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=0,

NOT

in LD




1 2 3 4 5 6 7 8

SNP 2 3 billion

letters

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

r2=0.5,

partial

LD



1 2 3 4 5 6 7 8

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 2 3 billion

letters

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

1 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3


r2=0.5,

partial

LD


Conrad et al. 2006 Nat Genet

Average number of haplotypes per genomic region

Outline


2. LD and Tag SNPs



Linkage Disequilibrium and tag SNPs

Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1: causal SNP

3 billion

letters

Direct association: genotype SNP1 in Cases and Controls.


Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1

3 billion

letters

Indirect association: genotype SNP2 in Cases and Controls.

If SNP1 affects disease risk, then SNP2 will also be associated!

SNP 2

r2=1,

in LD


Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1

3 billion

letters

Indirect association: genotype SNP3 in Cases and Controls.

If SNP1 affects disease risk, then SNP3 will also be associated!

SNP 3

r2=0.7,

partial

LD

SNP 2


Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):

If SNP1 is causal and LD(SNP1,SNP2) = r2, then

Power of an association study of SNP1 with N samples =

Power of an association study of SNP2 with N/r2 samples.






Proof:

Let g1 and g2 be genotypes of SNP1 and SNP2 respectively

and π be phenotype, all normalized to mean 0 and variance 1.

Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics).






Proof:

Let g1 and g2 be genotypes of SNP1 and SNP2 respectively

and π be phenotype, all normalized to mean 0 and variance 1.

Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics):

SNP1 with N samples: Nρ(g1, π)2 = NE(g1· π)2

SNP2 with N/r2 samples: (N/r2)ρ(g2, π)2 = (N/r2)E(g2 · π)2

= (N/r2)E([rg1 + (g2-rg1)] · π)2

= (N/r2)E(rg1· π)2 = NE(g1· π)2. Q.E.D.


Control Case

Case

Case

Case

Control

Control

Control

Risk haplotype

Question: Which SNP to genotype?

Answer: Choose 1 SNP per haplotype block,

and take advantage of indirect association!

Case Control


Control Case

Case

Case

Case

Control

Control

Control

Needed: a resource describing the haplotypes

at each location in the genome.

Case Control

Risk haplotype

The International HapMap Project: 270 samples from 4 populations

CEU European USA 90 30 trios

YRI Yoruba Nigeria 90 30 trios

CHB Chinese China 45 unrelated

JPT Japanese Japan 45 unrelated

Genetic differences between populations are small

68% frequency 50% frequency C allele of rs10910034

A allele of rs260509

52% frequency 51% frequency

11kb away on chr 1

LD differences between populations are large!

68% frequency 50% frequency C allele of rs10910034

A allele of rs260509

52% frequency 51% frequency

11kb away on chr 1 r2 = 0.97 r2 = 0.34

HapMap project: a resource for “SNP tagging”

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 SNP 3

SNP1 “tags” this entire haplotype block at an r2 of 0.7


How to select SNPs to genotype in an association study:

• Choose genomic region(s) of interest.

• Look up HapMap SNPs in the genomic region(s).

• Choose a subset of HapMap SNPs which “tag” haplotype

blocks in the genomic region(s).

(e.g. Tagger algorithm, de Bakker et al. 2005 Nat Genet)

Note: because LD patterns vary by population, it is

important to choose tag SNPs using a HapMap population

similar to the population in the association study.


International HapMap Consortium 2007 Nature; also see Barrett et al. 2006 Nat Genet,

Smith et al. 2006 Genomics, International HapMap Consortium 2005 Nature

How many “tag SNPs” are required?

For the entire genome, the answer is:

Thus, to choose tag SNPs at an r2 of 0.8, we need roughly

1 SNP per 3kb in YRI, or 1 SNP per 5kb in CEU or CHB+JPT

Things aren’t always what they seem


• Estimating LD using a small number of HapMap samples

may lead to overfitting.

• HapMap SNPs are not a random subset of SNPs.


• Estimating LD using a small number of HapMap samples

may lead to overfitting.

• HapMap SNPs are not a random subset of SNPs.

Bhangale et al. 2008 Nat Genet


According to International HapMap Consortium 2007 Nature:

82% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0

According to Bhangale et al. 2008 Nat Genet:

66% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0

Bhangale et al. 2008 Nat Genet

Multi-SNP tagging

Haplotype

1 2 3 4 [freq. 25% for each haplotype]

SNP1 A A C C

SNP2 A C C A

SNP3 A C A C

r2=0,

NOT

in LD

(causal)

Multi-SNP tagging

Haplotype

1 2 3 4 [freq. 25% for each haplotype]

SNP1 A A C C

SNP2+3 A+A C+C C+A A+C r2=1,

YES

in LD

(causal)

Multi-SNP tagging

Pe’er et al. 2006 Nat Genet

also see Zaitlen et al. 2007 Am J Hum Genet

Outline


2. LD and Tag SNPs



What is imputation?

Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010

Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics

What is imputation?

? Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010


Imputation: Why try?

• Increase power to detect disease association at untyped causal SNP

(imputed causal SNP may have stronger association than tag SNP)


r2 = 0.8

Causal SNP




Causal SNP






• Enable meta-analysis of studies on Affymetrix + Illumina chips




• Enable meta-analysis of studies on Affymetrix + Illumina chips

• Improve genotype data quality

Imputation: Algorithms

Hidden Markov Model (HMM) based approaches:

• IMPUTE (Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet,

Howie et al. 2012 Nat Genet)

• MACH (Li et al. 2010 Genet Epidemiol)

• fastPHASE/BIMBAM (Scheet/Stephens 2006 AJHG, Servin/Stephens 2007

PLoS Genet, Guan/Stephens 2008 PLoS Genet)

• GEDI (Kennedy et al. 2008 ISBRA)

Localized Haplotype Clustering:

• BEAGLE (Browning/Browning 2007 AJHG, Browning/Browning 2009 AJHG)

Likelihood-based approaches:

• UNPHASED (Dudbridge 2008 Hum Hered)

• SNPMStat (Lin et al. 2008 AJHG)

reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG

Imputation: What do the algorithms output?

Integer-valued genotypes at untyped SNPs

e.g. genotype = 2

OR

Continuous genotype dosages at untyped SNPs

e.g. genotype dosage = 1.79

OR

Continuous genotype probabilities at untyped SNPs

e.g. genotype probabilities P(0) = 0.01, P(1) = 0.19, P(2) = 0.80

Imputation: People do it.


HMM-based imputation approaches

hap1

hap2

hap3

hap4

hap5

Imp.


? ? ?

Note: current paradigm is to first phase the data, then run imputation on

phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)

HMM-based imputation approaches

hap1

hap2

hap3

hap4

hap5

Imp.


Note: current paradigm is to first phase the data, then run imputation on

phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Normalized difference between true and imputed allele frequency

• Measures whether imputation is biased towards ref or var allele

Imputation using HapMap data


common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95

(European-ancestry WTCCC samples, Affymetrix & Illumina chips)



common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95

common SNPs imputed using HapMap3 CEU+TSI (N=410): r2 = 0.96

(European-ancestry WTCCC samples, Affymetrix & Illumina chips)



x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)

y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)

Low-coverage sequencing + imputation

increases power vs. genotyping arrays

Cost per

sample

Actual

#samples

Average

imputation r2

Effective

#samples

Illumina 1M array $400 750 1.00 750

0.4x sequencing $83* 3,600 0.81** 2,900

0.1x sequencing $43* 7,000 0.64** 4,500

Pasaniuc et al. 2012 Nat Genet; also see Cai et al. 2015 Nature, Davies et al. 2016 Nat Genet

Effective sample size of a GWAS with a $300,000 budget:

*Based on sample preparation cost of $30/sample, which is conservatively

double the $15/sample reported by Rohland & Reich 2012 Genome Res,

and on $133 per 1x sequencing (Illumina Network cost).

**Imputation r2 attained at Illumina 1M SNPs by downsampling reads from

real off-target exome sequencing data. Relative performance of

low-coverage sequencing will be even higher at non-Illumina 1M SNPs.

Outline


2. LD and Tag SNPs


4. LD and fine-mapping (to be continued, Tue of Week 4)

Definition of fine-mapping

Manhattan plot from Ikram et al. 2010 PLoS Genet

Which of these SNPs on chr 6 is the biologically causal SNP?

(Ditto for chr 5, 8, 12, 19)

WTCCC fine-mapping study

Maller et al. 2012 Nat Genet

GWAS in Europeans

SNP1: P-value = 10-8

LD and fine-mapping in Europeans

TCF7L2 locus in T2D: 1 top signal


Fine-mapping in Europeans

SNP1: P-value = 10-8 CAUSAL??

SNP2: P-value = 10-8 CAUSAL??

LD and fine-mapping in Europeans

FTO locus in T2D: many top signals


Fine-mapping in Europeans Fine-mapping in Africans

SNP1: P-value = 10-8 SNP1: P-value = 10-5

SNP2: P-value = 10-8 SNP2: P-value = 0.62

SNP3: P-value = 0.41 SNP3: P-value = 10-5

LD in Europeans LD in Africans

LD and cross-population fine-mapping

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.99 0.08

SNP2 0.99 1.00 0.07

SNP3 0.08 0.07 1.00

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.12 0.98

SNP2 0.12 1.00 0.14

SNP3 0.98 0.14 1.00

Fine-mapping in Europeans Fine-mapping in Africans

SNP1: P-value = 10-8 SNP1: P-value = 10-5 CAUSAL

SNP2: P-value = 10-8 SNP2: P-value = 0.62

SNP3: P-value = 0.41 SNP3: P-value = 10-5

LD in Europeans LD in Africans

LD and cross-population fine-mapping

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.99 0.08

SNP2 0.99 1.00 0.07

SNP3 0.08 0.07 1.00

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.12 0.98

SNP2 0.12 1.00 0.14

SNP3 0.98 0.14 1.00

LD and multi-ethnic fine-mapping

Zaitlen*, Pasaniuc* et al. 2010 Am J Hum Genet

also see Morris 2011 Genet Epidemiol, Udler et al. 2009 Hum Mol Genet,

Wu et al. 2013 PLoS Genet, Peters et al. 2013 PLoS Genet, Liu et al. 2016 Am J Hum Genet

• Linkage Disequilibrium is good, because we can tag most

common SNPs using chips with 1,000,000 SNPs or less.

• Linkage Disequilibrium is good, because we can infer

imputed genotypes at most common HapMap SNPs.

• Linkage Disequilibrium is bad, because it leads to

ambiguity as to the causal SNP when doing fine-mapping.

• Studying multiple populations, especially Africans (low LD),

can improve our ability to localize causal variants.

Conclusions

EPI 511: Office Hours

Instructor: Alkes

Office Hours: Thu 3:30-4:30pm, Building 2, Room 211


(Please put EPI511 in the subject of your email)

Teaching Assistant: Armin

Office Hours: Fri + Mon 2-3pm, Building 2, Room 209










http://sysbiophd.harvard.edu/people/2014/armin-schoech

epi 511, advanced population and medical genetics

Documents