human genetics and gene mapping of complex...

Human Genetics and Gene Mapping of Complex

Traits

Advanced Genetics, Spring 2019

Human Genetics Series

Thursday 4/4/19

Nancy L. Saccone, Ph.D.

Dept of Genetics

[email protected] / 314-747-3263

What is different about Human Genetics(recall from Cristina Strong's lectures)

• Imprinting – uniquely mammalian

• Trinucleotide repeat diseases – "anticipation"

• Can study complex behaviors and cognition (neurogenetics)

• Extensive sequence variation leads to common/complex

disease

1. Common disease, common variant hypothesis

2. Large # of small-effect variants

3. Large # of large-effect rare variants

4. Combo of genotypic, environmental, epigenetic interactions

Greg Gibson, Nature Rev Genet 2012

Outline and learning objectives

• Human disease mapping approaches:

• Linkage analysis

• Disease models

• Recombination, LOD scores

• Genetic association analyses

• Use of regression-based tests

• "effect sizes (beta and odds ratio), variance explained

• Genome-Wide Association Studies (GWAS)

• Multiple test correction: what and why

Mapping disease genes

– Linkage

• quantify co-segregation of trait and genotype in families

• Association

• Common design: case-control sample of unrelateds, analyzed for allele frequency differences

cases controls

ACAC

AC

AC

AA AC

CC

AC

AC

CC

CC

CC

CC

AC

AA

AC

AA

AA

AA

LOD score traditionally

used to measure statistical

evidence for linkage

Comparing Linkage and Association

Linkage mapping: Association mapping:

Requires family data Unrelated cases/controls OR

Case/parents OR family design

Disease travels with marker allele

within families (close genetic

distance between disease locus and

marker)

Disease is associated with marker

allele that may be either causative or

in linkage disequilibrium with causal

variant

Relationship between same allele

and trait need not exist across the

full sample (e.g. across different

families)

Works only if association exists at

the population level

robust to allelic heterogeneity: if

different mutations occur within the

same gene/locus, the method works

not robust to allelic heterogeneity

signals for complex traits tend to be

broad (~20 Mb)

association signals generally not as

broad

Human DNA sequence variation

• Single nucleotide polymorphisms (SNPs)

• Single nucleotide variants (SNVs)

Strand 1: A A C C A T A T C ... C G A T T ...

Strand 2: A A C C A T A T C ... C A A T T ...

Strand 3: A A C C C T A T C ... C G A T T ...

• Provide biallelic markers

• Coding SNPs may directly affect protein products

• Non-coding SNPs still may affect gene regulation or expression

• Low-error, high-throughput technology

• Common in genome

Number of SNPs in dbSNP over timesolid: cumulative # of non-redundant SNPs.

dotted: validated. dashed: double-hit

from: The Intl HapMap Consortium

Nature 2005, 437:1299-1320.

Number of SNVs in dbSNP over timedotted: validated

From: Fernald et al., Bioinformatics challenges for personalized medicine, Bioinformatics 27:1741-1748 2011


2017Update from NCBI dbSNP build 151, Oct 6, 2017: RefSNP clusters (rs numbers)

https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi



2017Update from NCBI dbSNP build 151, Oct 6, 2017: RefSNP clusters (rs numbers)


660

113


Questions that we can answer with SNPs:

• Which genetic loci influence risk for common human

diseases/traits? (Disease gene mapping studies,

including GWAS – genome-wide association studies)

• Which genetic loci influence efficacy/safety of drug

therapies? (Pharmacogenetics)

• Population genetics questions

• evidence of selection

• identification of recombination hotspots

Figure from Strachan and Read, Human Molecular Genetics

Example of autosomal dominant (fully penetrant, no phenocopies)

General hallmarks:

All affected have at least one affected parent, so the disease occurs

in all generations above the latest observed case.

The disease does not appear in descendants of two unaffecteds.

Possible molecular explanation: harmful antimorph where one copy

sufficient to cause harm/dysfunction.


Example of autosomal recessive (fully penetrant, no phenocopies).

General hallmarks:

Many/most affecteds have two unaffected parents, so the disease

appears to skip generations.

On average, 1/4 of (carrier x carrier) offspring are affected.

(Affected x unaffected) offspring are usually unaffected (but carriers)

(Affected x affected) offspring are all affected.


Example of autosomal recessive (fully penetrant, no phenocopies).

Possible molecular explanation:

- loss of function allele - hypomorph or null where one wild-type

allele produces enough protein for normal function, but loss of

both wild-type alleles results in disease.

Classic models of disease

Classical autosomal dominant inheritance (no phenocopies,

fully penetrant).

Penetrance table:

f++ f+D fDD

0 1 1

Often the dominant allele is rare, so that probability of

homozygous dd individuals occurring is negligible.

Classical autosomal recessive inheritance (no phenocopies,

fully penetrant).

Penetrance table:

f++ f+d fdd

0 0 1

Genetic models of disease

Other examples of penetrance tables (locus-specific):

f++ f+d fdd

0 1 1

0 0 1

0 0 0.9

0.1 1 1

0.1 0.8 0.8

Incomplete/reduced penetrance: when the risk genotype's effect

on phenotype is not always expressed/observed. (e.g. due to

environmental interaction, modifier genes)

Phenocopy: individual who develops the disease/phenotype in the

absence of "the" risk genotype (e.g. through environmental

effects, heterogeneity of genetic effects)

Part I: Human linkage studies

General “linkage screen" approach:

• Recruit families

• Genotype individuals at marker loci along the genome

• If a marker locus is "near" the trait-influencing locus,

the parental alleles from the same grandparent at

these two loci "tend to be inherited together"

(recombination between the two loci is rare)

Need to track co-segregation of trait and markers (number

of recombination events among observed meioses)

Linkage analysis concepts

A1

B1

A1

B1

A2

B2

A2

B2

Parent

gametes

Example of meiosis resulting in some recombinant gametes

A1

B2

A1

B1

A2

B1

A2

B2

A1

B1

A2

B2

A1

B1

A1

B2

A2

B1

A2

B2

sister

chromatids

Non-sister chromatids may

cross over between the loci

meiosis

Recombination with respect to 2 loci: has

occurred when in the individual's

haplotype (ie. in a gamete inherited

from the parent), the alleles at the 2

loci come from the 2 different

grandparents.

From one

grandparent

Linkage analysis concepts

A1

B1

A1

B1

A2

B2

A2

B2

Parent

gametes

Example of meiosis resulting in some recombinant gametes

A1

B2

A1

B1

A2

B1

A2

B2

A1

B1

A2

B2

A1

B1

A1

B2

A2

B1

A2

B2

sister

chromatids

Non-sister chromatids may

cross over between the loci

meiosis

NOTE: genotype does not always

indicate haplotype.

Doubly heterozygous genotype

A1A2 B1B2 could have one of 2

phases

From one

grandparent

Testing for linkage

• L = likelihood function for the observed data

• Traditionally, a LOD >= 3 (or 3.6) is considered

"significant"

Let q = the probability of recombination between 2 given loci

(range, 0 to ½)

Null hypothesis: q = ½

Alternative hypothesis: q = 𝜃 , the max likelihood estimate of q

Linkage analysis in pedigrees

First, how to detect linkage between two genotyped marker loci.

Example: phase known, fully heterozygous parents

1 2

1 2

3 4

3 4

1 3

1 4

2 3

2 3

Can explicitly count: 1 R, 3 NR.

Binomial probability distribution gives

us a likelihood function for the

observed data:

q = (# recombs) / (# meioses) = 1/4,

LOD = 0.227 << traditional LOD threshold of 3 or 3.6


In general, for the case where all recombinants and non-

recombinants can be explicitly counted:

• Say we have k recombinants in a total of m meioses.

Then

(Unless k/m > ½ , in which case 𝜃 = ½)

In cases where cannot definitively count all recombs/non-

recombs (e.g. if phase is not known), likelihood

function becomes more complex.

L(q) m

k

q

k (1q)mk

where ˆ q # recombs

# meioses

k

m


Next, how do we compute the LOD score indicating

linkage between a disease locus and a marker locus?

Key: must assume a disease model. This allows us to

assign possible genotypes at the disease locus.

Disease model specifies:

disease gene frequency pD

penetrances for the 3 genotypic states


Example: Suppose the disease allele D is fully penetrant,

dominant, rare. Suppose phenocopies do not occur

1 2 3 3

1 3 2 3

Penetrance table: d d D d DD

0.0 1.0 1.0


Example: Suppose the disease allele D is fully penetrant,

dominant, rare. Suppose phenocopies do not occur

1 2

D d

3 3

d d

1 3

D d

2 3

d d

Penetrance table: d d D d DD

0.0 1.0 1.0

Mother not informative

Phase of father unknown.

For simplicity, suppose we

know phase of the father is

1 D / 2 d

Then:

0 recombs, 2 non-recombs

𝐿 𝜃 =20

𝜃0(1 − 𝜃)2

Linkage analysis for complex diseases

• Often the mode of inheritance (disease model) is unclear

• Would like to perform analysis without assuming a

specific genetic model

• "Model free" approach (more on this soon)

Part II: Genetic Association Testing

Typical statistical analysis models:

Quantitative continuous trait:

linear regression

Dichotomous trait – e.g. case/control:

logistic regression

- more flexible than chi-square / Fisher’s exact test

- can include covariates

- provides estimate of odds ratio

Linear regression

Let y = quantitative trait value

OR

predicted quantitative trait value

x1= SNP genotype (e.g. # copies of designated allele: 0,1,2)

x2, …, xn are covariate values (e.g. age, sex)

Null hypothesis H0: b1 = 0.

The SNP “effect size” is represented by b1, the coefficient of x1.

The key test: Is there significant evidence that b1 is non-zero?

nn2211 x...xx + =y bbb

y

error nn2211 x...xx + =y bbb

a.k.a. “residual”

Least squares linear regression: general example

x + =y b

Residual deviations

x

yFitted line,

Slope = b

The least squares solution finds and b that minimize the sum

of the squared residuals.

Least squares linear regression: general example

x + =y b

Residual deviations

x

y

The least squares solution finds and b that minimize the sum

of the squared residuals.

Would NOT minimize the

sum of squared residuals

Fitted line,

Slope = b

SNP Marker Additive Coding:

Genotype x1

1/1 0

1/2 1

2/2 2

Codes number of “2” alleles

Least squares linear regression

x + =y b

b = slope of

Fitted line

0 1 2

x-axis: number of alleles

In this example, as # of coded alleles increases, the trait value

increases

y-a

xis

: quantita

tive t

rait v

alu

e

x + =y b

b = slope of

Fitted line

0 1 2

x-axis: number of alleles

y-a

xis

: quantita

tive t

rait v

alu

e

r2 = squared correlation coefficient

Indicates proportion of phenotypic variance in y

that’s explained by x

“Phenotypic variance explained”

Another use of linear regression:

Model-free linkage analysis

• Based on a simple idea of genetic (allelic) "sharing"

• Among sets of relatives, evaluate

– the sharing of the disease trait AND

– the sharing of genetic marker alleles

• Identity-by-descent (IBD) vs identity-by-state (IBS):

– two alleles of the same form are said to be IBS

– two IBS alleles are IBD if they are descended from the same

ancestral allele

Traditional sib pair linkage analysis

“Model-free / non-parametric”

• Idea: if two sibs are alike in phenotype, they should

be alike in genotype near a trait-influencing locus.

• to measure "alike in genotype" : Identity by descent (IBD). Not the same as identity by state.

1 31 2

1 31 2

1 2

1 2

2 3

1 3

IBS=1

IBD=0

IBS=1

IBD=1

Sib pair linkage analysis of quantitative traits

• Haseman-Elston regression: Compare IBD sharing to

the squared trait difference of each sib pair.

0 1 2

(trait difference)2

IBD

Example sib-pair based LOD score plot., from Saccone et al., 2000

Logistic regression for dichotomous traits

Let y = 1 if case, 0 if control (2 values)

Let P = probability that y = 1 (case)

Let x1 = genotype (additive coding)

Why?

nn2211 x...xx + =P-1

Pln logit(P) bbb

Logit function

• Usual regression expects a dependent

variable that can take on any value, (-,)

• A probability is in [0,1], so not a good

dependent variable

• Odds = p/(1-p) is in [0,)

• Logit = ln(odds) is in (-,)

Think of the shapes of the graphs

• y = x/(1-x) (x in place of P)

As x varies

from 0 to 1,

y varies

from 0 to

• y=ln(x)

varies from

- to

1

Logistic regression


Let P = probability that y = 1 (case)

Note that can exponentiate both sides to get odds = P / (1-P):

nn2211 x...xx + =P-1

Pln logit(P) bbb

ee nn2211 x...xx +

=P-1

POdds

bbb

What about the “effect size”?

It’s the “odds ratio”, and it is still related to b1!

Odds ratio

• The number e (=2.718…) is the base of

natural logarithms

•

• is the odds ratio; if b1=0 then odds

ratio is 1 (no effect of the SNP)

1be

10 e

To get odds ratio per copy of the allele (“effect size”)

• Full model:

• Odds when x1 = 1 (1 copy of the allele)

• Odds when x1 = 0 (0 copies of the allele)

• Odds Ratio:

]...[ 11

1nnxx

eP

P bb

]...[0

]...[00

220

1

11)1/( nnnn xxx

xxeePP

bbbb

1220

1

11 ]...[1

]...[11 )1/(

bbbbb

nnnn xx

xxx

eePP

1

nn22

1nn22

]x...x +[

]x...x +[

00

11

=

1

1 b

bb

bbb

ee

e

)P/(P

)P/(P

Logistic regression summary


Let P = probability that y = 1 (case), ranges from 0 to 1

Then logit(P) ranges from - to

Odds ratio

nn2211 x...xx + =P-1

Pln logit(P) bbb

1be

Similar to case of linear regression, can compute an analog

to “variance explained,” usually also called r2

Displaying GWAS results

“Manhattan plot”

x-axis: chromosomal position

y-axis: -log10(p-value)

p = 1x10-8 is plotted at y=8,

p = 5x10-8 is plotted at y = 7.3

“Q-Q plot”

Quantile-Quantile plot

Idea: Rank tested SNPs by

association evidence;

compare number of

observed vs expected

associations under the null

at a given significance level

Bloom et al., Ann Am Thorac Soc 2014


– “Q-Q plot”: Quantile-quantile plot

Helps detect systematic bias in data:

Most datapoints should be close to the y=x line

Exception: signals (lowest, most significant p-values)


– “Q-Q plot”: Quantile-quantile plot

JAMA 299:1335-1344 (2008)

Multiple test correctionP-value: probability of observing the data or more "extreme" data

when the null hypothesis holds = Pr(as or more extreme | null)

For a single test, traditional required threshold is p-value ≤ 0.05

Bonferroni correction

• If n = number of independent tests, then under Bonferroni

correction, p-value threshold for "significance" is 0.05/n , or

more generally, require Psingletest ≤ Pexperimentwide / n

In the GWAS setting – how to determine the number of

"independent tests"?

• Account for linkage disequilibrium (LD), or correlation between

alleles (more on this later)

• Typical threshold corrects for 1 million independent tests,

requiring p ≤ 5 x 10-8

Why is multiple test correction important?

• Example: suppose you didn't correct for multiple tests?

10 independent tests, what's the probability X that we'll see at least

one p ≤ 0.05 just by chance, even though the null hypothesis holds?

X = 1 – (chance that all 10 tests have p > 0.05 | null)

= 1 – (chance that one test has p > 0.05 | null)10

= 1 – (1 – 0.05)10

= 1 – (0.95)10 = 1 – 0.5987 = 0.401

Greater than 40% chance!

• More generally, Sidak correction for N tests:

Require Pexperiment-wide = 1 – (1 – Psingletest)N

So Psingletest = 1 – (1 – Pexperiment-wide)1/N

To get back to Bonferroni…

• Sidak correction for N tests:

Psingletest = 1 – (1 – Pexperiment-wide)1/N

So for 10 tests, what Psingletest should we require

to get Pexperiment-wide = 0.05 ?

Psingletest = 1 – (1 – Pexperiment-wide)1/N = 1 – (1 – 0.05)1/10 = 0.00511

This is almost 0.05/10 = 0.005 (!)

In fact Bonferroni approximates the more exact Sidak:

Psingletest = 1 – (1 – Pexperiment-wide)1/N ~ Pexperiment-wide / N

(via Taylor series expansion)

Aside on multiple testing and the Bonferroni

correction

Optional HW (not to turn in):

If we want experiment-wide = 0.05 with n=100 tests, what should we

require for singletest?

Answer: ~ 0.0005128 (almost 0.05/100).

Summary of multiple test corrections

Sidak correction:

singletest = 1 – [1- experiment-wide]1/n

The Bonferroni correction is an approximation of Sidak (Taylor

series expansion):

singletest = experiment-wide / n

So back to GWAS correction.

The classical choice for singletest = 5 x 10-8.

Corresponds to experiment-wide = 0.05 = n* singletest , n = 1,000,000 .

GWAS

Successful by several metrics:

• Identifying genetic variants underlying complex diseases

• Highlighting novel genes, pathways, biology

• Motivating functional followup, collaborative meta-analyses

Less successful by other metrics:

• "Top" associated SNPs explain limited phenotypic variance

(typical odds ratios ~ 1.3, variance explained ~ 1%)

But even by that metric, there's good news:

• "Polygenic risk scores" - As a whole, the variation assayed by

GWAS may be able to explain even more of the phenotypic

variance (work of Peter Visscher et al.)

GWASes rely on linkage disequilibrium (LD) to "tag" variation,

and thus must be interpreted in the context of LD: the

signal SNP may be different from the causal SNP.

human genetics and gene mapping of complex...

Documents