human genetics and gene mapping of complex...
TRANSCRIPT
Human Genetics and Gene Mapping of Complex
Traits
Advanced Genetics, Spring 2019
Human Genetics Series
Thursday 4/4/19
Nancy L. Saccone, Ph.D.
Dept of Genetics
[email protected] / 314-747-3263
What is different about Human Genetics(recall from Cristina Strong's lectures)
• Imprinting – uniquely mammalian
• Trinucleotide repeat diseases – "anticipation"
• Can study complex behaviors and cognition (neurogenetics)
• Extensive sequence variation leads to common/complex
disease
1. Common disease, common variant hypothesis
2. Large # of small-effect variants
3. Large # of large-effect rare variants
4. Combo of genotypic, environmental, epigenetic interactions
Greg Gibson, Nature Rev Genet 2012
Outline and learning objectives
• Human disease mapping approaches:
• Linkage analysis
• Disease models
• Recombination, LOD scores
• Genetic association analyses
• Use of regression-based tests
• "effect sizes (beta and odds ratio), variance explained
• Genome-Wide Association Studies (GWAS)
• Multiple test correction: what and why
Mapping disease genes
– Linkage
• quantify co-segregation of trait and genotype in families
• Association
• Common design: case-control sample of unrelateds, analyzed for allele frequency differences
cases controls
ACAC
AC
AC
AA AC
CC
AC
AC
CC
CC
CC
CC
AC
AA
AC
AA
AA
AA
LOD score traditionally
used to measure statistical
evidence for linkage
Comparing Linkage and Association
Linkage mapping: Association mapping:
Requires family data Unrelated cases/controls OR
Case/parents OR family design
Disease travels with marker allele
within families (close genetic
distance between disease locus and
marker)
Disease is associated with marker
allele that may be either causative or
in linkage disequilibrium with causal
variant
Relationship between same allele
and trait need not exist across the
full sample (e.g. across different
families)
Works only if association exists at
the population level
robust to allelic heterogeneity: if
different mutations occur within the
same gene/locus, the method works
not robust to allelic heterogeneity
signals for complex traits tend to be
broad (~20 Mb)
association signals generally not as
broad
Human DNA sequence variation
• Single nucleotide polymorphisms (SNPs)
• Single nucleotide variants (SNVs)
Strand 1: A A C C A T A T C ... C G A T T ...
Strand 2: A A C C A T A T C ... C A A T T ...
Strand 3: A A C C C T A T C ... C G A T T ...
• Provide biallelic markers
• Coding SNPs may directly affect protein products
• Non-coding SNPs still may affect gene regulation or expression
• Low-error, high-throughput technology
• Common in genome
Number of SNPs in dbSNP over timesolid: cumulative # of non-redundant SNPs.
dotted: validated. dashed: double-hit
from: The Intl HapMap Consortium
Nature 2005, 437:1299-1320.
Number of SNVs in dbSNP over timedotted: validated
From: Fernald et al., Bioinformatics challenges for personalized medicine, Bioinformatics 27:1741-1748 2011
Number of SNVs in dbSNP over timedotted: validated
2017Update from NCBI dbSNP build 151, Oct 6, 2017: RefSNP clusters (rs numbers)
https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi
Number of SNVs in dbSNP over timedotted: validated
2017Update from NCBI dbSNP build 151, Oct 6, 2017: RefSNP clusters (rs numbers)
https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi
660
113
Questions that we can answer with SNPs:
• Which genetic loci influence risk for common human
diseases/traits? (Disease gene mapping studies,
including GWAS – genome-wide association studies)
• Which genetic loci influence efficacy/safety of drug
therapies? (Pharmacogenetics)
• Population genetics questions
• evidence of selection
• identification of recombination hotspots
Figure from Strachan and Read, Human Molecular Genetics
Example of autosomal dominant (fully penetrant, no phenocopies)
General hallmarks:
All affected have at least one affected parent, so the disease occurs
in all generations above the latest observed case.
The disease does not appear in descendants of two unaffecteds.
Possible molecular explanation: harmful antimorph where one copy
sufficient to cause harm/dysfunction.
Figure from Strachan and Read, Human Molecular Genetics
Example of autosomal recessive (fully penetrant, no phenocopies).
General hallmarks:
Many/most affecteds have two unaffected parents, so the disease
appears to skip generations.
On average, 1/4 of (carrier x carrier) offspring are affected.
(Affected x unaffected) offspring are usually unaffected (but carriers)
(Affected x affected) offspring are all affected.
Figure from Strachan and Read, Human Molecular Genetics
Example of autosomal recessive (fully penetrant, no phenocopies).
Possible molecular explanation:
- loss of function allele - hypomorph or null where one wild-type
allele produces enough protein for normal function, but loss of
both wild-type alleles results in disease.
Classic models of disease
Classical autosomal dominant inheritance (no phenocopies,
fully penetrant).
Penetrance table:
f++ f+D fDD
0 1 1
Often the dominant allele is rare, so that probability of
homozygous dd individuals occurring is negligible.
Classical autosomal recessive inheritance (no phenocopies,
fully penetrant).
Penetrance table:
f++ f+d fdd
0 0 1
Genetic models of disease
Other examples of penetrance tables (locus-specific):
f++ f+d fdd
0 1 1
0 0 1
0 0 0.9
0.1 1 1
0.1 0.8 0.8
Incomplete/reduced penetrance: when the risk genotype's effect
on phenotype is not always expressed/observed. (e.g. due to
environmental interaction, modifier genes)
Phenocopy: individual who develops the disease/phenotype in the
absence of "the" risk genotype (e.g. through environmental
effects, heterogeneity of genetic effects)
Part I: Human linkage studies
General “linkage screen" approach:
• Recruit families
• Genotype individuals at marker loci along the genome
• If a marker locus is "near" the trait-influencing locus,
the parental alleles from the same grandparent at
these two loci "tend to be inherited together"
(recombination between the two loci is rare)
Need to track co-segregation of trait and markers (number
of recombination events among observed meioses)
Linkage analysis concepts
A1
B1
A1
B1
A2
B2
A2
B2
Parent
gametes
Example of meiosis resulting in some recombinant gametes
A1
B2
A1
B1
A2
B1
A2
B2
A1
B1
A2
B2
A1
B1
A1
B2
A2
B1
A2
B2
sister
chromatids
Non-sister chromatids may
cross over between the loci
meiosis
Recombination with respect to 2 loci: has
occurred when in the individual's
haplotype (ie. in a gamete inherited
from the parent), the alleles at the 2
loci come from the 2 different
grandparents.
From one
grandparent
Linkage analysis concepts
A1
B1
A1
B1
A2
B2
A2
B2
Parent
gametes
Example of meiosis resulting in some recombinant gametes
A1
B2
A1
B1
A2
B1
A2
B2
A1
B1
A2
B2
A1
B1
A1
B2
A2
B1
A2
B2
sister
chromatids
Non-sister chromatids may
cross over between the loci
meiosis
NOTE: genotype does not always
indicate haplotype.
Doubly heterozygous genotype
A1A2 B1B2 could have one of 2
phases
From one
grandparent
Testing for linkage
• L = likelihood function for the observed data
• Traditionally, a LOD >= 3 (or 3.6) is considered
"significant"
Let q = the probability of recombination between 2 given loci
(range, 0 to ½)
Null hypothesis: q = ½
Alternative hypothesis: q = 𝜃 , the max likelihood estimate of q
Linkage analysis in pedigrees
First, how to detect linkage between two genotyped marker loci.
Example: phase known, fully heterozygous parents
1 2
1 2
3 4
3 4
1 3
1 4
2 3
2 3
Can explicitly count: 1 R, 3 NR.
Binomial probability distribution gives
us a likelihood function for the
observed data:
q = (# recombs) / (# meioses) = 1/4,
LOD = 0.227 << traditional LOD threshold of 3 or 3.6
Linkage analysis in pedigrees
In general, for the case where all recombinants and non-
recombinants can be explicitly counted:
• Say we have k recombinants in a total of m meioses.
Then
(Unless k/m > ½ , in which case 𝜃 = ½)
In cases where cannot definitively count all recombs/non-
recombs (e.g. if phase is not known), likelihood
function becomes more complex.
L(q) m
k
q
k (1q)mk
where ˆ q # recombs
# meioses
k
m
Linkage analysis in pedigrees
Next, how do we compute the LOD score indicating
linkage between a disease locus and a marker locus?
Key: must assume a disease model. This allows us to
assign possible genotypes at the disease locus.
Disease model specifies:
disease gene frequency pD
penetrances for the 3 genotypic states
Linkage analysis in pedigrees
Example: Suppose the disease allele D is fully penetrant,
dominant, rare. Suppose phenocopies do not occur
1 2 3 3
1 3 2 3
Penetrance table: d d D d DD
0.0 1.0 1.0
Linkage analysis in pedigrees
Example: Suppose the disease allele D is fully penetrant,
dominant, rare. Suppose phenocopies do not occur
1 2
D d
3 3
d d
1 3
D d
2 3
d d
Penetrance table: d d D d DD
0.0 1.0 1.0
Mother not informative
Phase of father unknown.
For simplicity, suppose we
know phase of the father is
1 D / 2 d
Then:
0 recombs, 2 non-recombs
𝐿 𝜃 =20
𝜃0(1 − 𝜃)2
Linkage analysis for complex diseases
• Often the mode of inheritance (disease model) is unclear
• Would like to perform analysis without assuming a
specific genetic model
• "Model free" approach (more on this soon)
Part II: Genetic Association Testing
Typical statistical analysis models:
Quantitative continuous trait:
linear regression
Dichotomous trait – e.g. case/control:
logistic regression
- more flexible than chi-square / Fisher’s exact test
- can include covariates
- provides estimate of odds ratio
Linear regression
Let y = quantitative trait value
OR
predicted quantitative trait value
x1= SNP genotype (e.g. # copies of designated allele: 0,1,2)
x2, …, xn are covariate values (e.g. age, sex)
Null hypothesis H0: b1 = 0.
The SNP “effect size” is represented by b1, the coefficient of x1.
The key test: Is there significant evidence that b1 is non-zero?
nn2211 x...xx + =y bbb
y
error nn2211 x...xx + =y bbb
a.k.a. “residual”
Least squares linear regression: general example
x + =y b
Residual deviations
x
yFitted line,
Slope = b
The least squares solution finds and b that minimize the sum
of the squared residuals.
Least squares linear regression: general example
x + =y b
Residual deviations
x
y
The least squares solution finds and b that minimize the sum
of the squared residuals.
Would NOT minimize the
sum of squared residuals
Fitted line,
Slope = b
SNP Marker Additive Coding:
Genotype x1
1/1 0
1/2 1
2/2 2
Codes number of “2” alleles
Least squares linear regression
x + =y b
b = slope of
Fitted line
0 1 2
x-axis: number of alleles
In this example, as # of coded alleles increases, the trait value
increases
y-a
xis
: quantita
tive t
rait v
alu
e
x + =y b
b = slope of
Fitted line
0 1 2
x-axis: number of alleles
y-a
xis
: quantita
tive t
rait v
alu
e
r2 = squared correlation coefficient
Indicates proportion of phenotypic variance in y
that’s explained by x
“Phenotypic variance explained”
Another use of linear regression:
Model-free linkage analysis
• Based on a simple idea of genetic (allelic) "sharing"
• Among sets of relatives, evaluate
– the sharing of the disease trait AND
– the sharing of genetic marker alleles
• Identity-by-descent (IBD) vs identity-by-state (IBS):
– two alleles of the same form are said to be IBS
– two IBS alleles are IBD if they are descended from the same
ancestral allele
Traditional sib pair linkage analysis
“Model-free / non-parametric”
• Idea: if two sibs are alike in phenotype, they should
be alike in genotype near a trait-influencing locus.
• to measure "alike in genotype" : Identity by descent (IBD). Not the same as identity by state.
1 31 2
1 31 2
1 2
1 2
2 3
1 3
IBS=1
IBD=0
IBS=1
IBD=1
Sib pair linkage analysis of quantitative traits
• Haseman-Elston regression: Compare IBD sharing to
the squared trait difference of each sib pair.
0 1 2
(trait difference)2
IBD
Example sib-pair based LOD score plot., from Saccone et al., 2000
Logistic regression for dichotomous traits
Let y = 1 if case, 0 if control (2 values)
Let P = probability that y = 1 (case)
Let x1 = genotype (additive coding)
Why?
nn2211 x...xx + =P-1
Pln logit(P) bbb
Logit function
• Usual regression expects a dependent
variable that can take on any value, (-,)
• A probability is in [0,1], so not a good
dependent variable
• Odds = p/(1-p) is in [0,)
• Logit = ln(odds) is in (-,)
Think of the shapes of the graphs
• y = x/(1-x) (x in place of P)
As x varies
from 0 to 1,
y varies
from 0 to
• y=ln(x)
varies from
- to
1
Logistic regression
Let y = 1 if case, 0 if control (2 values)
Let P = probability that y = 1 (case)
Note that can exponentiate both sides to get odds = P / (1-P):
nn2211 x...xx + =P-1
Pln logit(P) bbb
ee nn2211 x...xx +
=P-1
POdds
bbb
What about the “effect size”?
It’s the “odds ratio”, and it is still related to b1!
Odds ratio
• The number e (=2.718…) is the base of
natural logarithms
•
• is the odds ratio; if b1=0 then odds
ratio is 1 (no effect of the SNP)
1be
10 e
To get odds ratio per copy of the allele (“effect size”)
• Full model:
• Odds when x1 = 1 (1 copy of the allele)
• Odds when x1 = 0 (0 copies of the allele)
• Odds Ratio:
]...[ 11
1nnxx
eP
P bb
]...[0
]...[00
220
1
11)1/( nnnn xxx
xxeePP
bbbb
1220
1
11 ]...[1
]...[11 )1/(
bbbbb
nnnn xx
xxx
eePP
1
nn22
1nn22
]x...x +[
]x...x +[
00
11
=
1
1 b
bb
bbb
ee
e
)P/(P
)P/(P
Logistic regression summary
Let y = 1 if case, 0 if control (2 values)
Let P = probability that y = 1 (case), ranges from 0 to 1
Then logit(P) ranges from - to
Odds ratio
nn2211 x...xx + =P-1
Pln logit(P) bbb
1be
Similar to case of linear regression, can compute an analog
to “variance explained,” usually also called r2
Displaying GWAS results
“Manhattan plot”
x-axis: chromosomal position
y-axis: -log10(p-value)
p = 1x10-8 is plotted at y=8,
p = 5x10-8 is plotted at y = 7.3
“Q-Q plot”
Quantile-Quantile plot
Idea: Rank tested SNPs by
association evidence;
compare number of
observed vs expected
associations under the null
at a given significance level
Bloom et al., Ann Am Thorac Soc 2014
Displaying GWAS results
– “Q-Q plot”: Quantile-quantile plot
Helps detect systematic bias in data:
Most datapoints should be close to the y=x line
Exception: signals (lowest, most significant p-values)
Displaying GWAS results
– “Q-Q plot”: Quantile-quantile plot
JAMA 299:1335-1344 (2008)
Multiple test correctionP-value: probability of observing the data or more "extreme" data
when the null hypothesis holds = Pr(as or more extreme | null)
For a single test, traditional required threshold is p-value ≤ 0.05
Bonferroni correction
• If n = number of independent tests, then under Bonferroni
correction, p-value threshold for "significance" is 0.05/n , or
more generally, require Psingletest ≤ Pexperimentwide / n
In the GWAS setting – how to determine the number of
"independent tests"?
• Account for linkage disequilibrium (LD), or correlation between
alleles (more on this later)
• Typical threshold corrects for 1 million independent tests,
requiring p ≤ 5 x 10-8
Why is multiple test correction important?
• Example: suppose you didn't correct for multiple tests?
10 independent tests, what's the probability X that we'll see at least
one p ≤ 0.05 just by chance, even though the null hypothesis holds?
X = 1 – (chance that all 10 tests have p > 0.05 | null)
= 1 – (chance that one test has p > 0.05 | null)10
= 1 – (1 – 0.05)10
= 1 – (0.95)10 = 1 – 0.5987 = 0.401
Greater than 40% chance!
• More generally, Sidak correction for N tests:
Require Pexperiment-wide = 1 – (1 – Psingletest)N
So Psingletest = 1 – (1 – Pexperiment-wide)1/N
To get back to Bonferroni…
• Sidak correction for N tests:
Psingletest = 1 – (1 – Pexperiment-wide)1/N
So for 10 tests, what Psingletest should we require
to get Pexperiment-wide = 0.05 ?
Psingletest = 1 – (1 – Pexperiment-wide)1/N = 1 – (1 – 0.05)1/10 = 0.00511
This is almost 0.05/10 = 0.005 (!)
In fact Bonferroni approximates the more exact Sidak:
Psingletest = 1 – (1 – Pexperiment-wide)1/N ~ Pexperiment-wide / N
(via Taylor series expansion)
Aside on multiple testing and the Bonferroni
correction
Optional HW (not to turn in):
If we want experiment-wide = 0.05 with n=100 tests, what should we
require for singletest?
Answer: ~ 0.0005128 (almost 0.05/100).
Summary of multiple test corrections
Sidak correction:
singletest = 1 – [1- experiment-wide]1/n
The Bonferroni correction is an approximation of Sidak (Taylor
series expansion):
singletest = experiment-wide / n
So back to GWAS correction.
The classical choice for singletest = 5 x 10-8.
Corresponds to experiment-wide = 0.05 = n* singletest , n = 1,000,000 .
GWAS
Successful by several metrics:
• Identifying genetic variants underlying complex diseases
• Highlighting novel genes, pathways, biology
• Motivating functional followup, collaborative meta-analyses
Less successful by other metrics:
• "Top" associated SNPs explain limited phenotypic variance
(typical odds ratios ~ 1.3, variance explained ~ 1%)
But even by that metric, there's good news:
• "Polygenic risk scores" - As a whole, the variation assayed by
GWAS may be able to explain even more of the phenotypic
variance (work of Peter Visscher et al.)
GWASes rely on linkage disequilibrium (LD) to "tag" variation,
and thus must be interpreted in the context of LD: the
signal SNP may be different from the causal SNP.