lecture 18: association studies i
DESCRIPTION
Lecture 18: Association Studies I. Date: 10/24/02 A mathematical formalism for linkage disequilibrium Allelic association in random populations Allelic association in case-control populations. Where We Are. We will be covering chapter 4 of Sham. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/1.jpg)
Lecture 18: Association Studies I
Date: 10/24/02 A mathematical formalism for linkage disequilibrium Allelic association in random populationsAllelic association in case-control populations
![Page 2: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/2.jpg)
Where We Are
We will be covering chapter 4 of Sham. This material is also covered in chapter 8 of
Liu’s Statistical Genomics.
![Page 3: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/3.jpg)
The Limitations of Per-Cross Linkage Analysis
Family-based linkage analysis is limited by the family unit.
When restricted to natural populations where families are small (e.g. humans), it is difficult to obtain an adequate sample size to detect small differences in linkage A and B or equivalently small .
B
A
A B
![Page 4: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/4.jpg)
How Many to Detect Recombination with =
0.005? Suppose you run a coupling linkage two-point
backcross with recombination fraction = 0.005. You expect a proportion (1-) of the offspring to be
nonrecombinant (e.g. AB and ab). Suppose you want to guarantee that you have
= 0.99 probability offinding at least one recombinant among the offspring scored. You seek N such that N11
![Page 5: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/5.jpg)
How Many to Detect Recombination with =
0.005?
7.918
1log
1log
1log1log
11
N
N
N
N
![Page 6: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/6.jpg)
How Many to Achieve Resolution of d = 0.005?
1
1
1
1
1
1EE
1
1loglog
unit
22
222
2
IN
N
nn-
d
ldI
nn
d
dlS
nnl
NRR
NRR
NRR
![Page 7: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/7.jpg)
How Many to Achieve Resolution of d = 0.005?
004.0449,2
04.0602,23
4.0518,147
192.3
192.3
96.12
1
2
2
2
N
dN
dN
dN
![Page 8: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/8.jpg)
Association Studies
Suppose instead of sampling families, which only allow a single generation of recombination to take place, you now let nature takes its course and you allow many generations to pass before checking for linkage.
You are now performing an association study and you have the power to detect tight linkage that you could not previously detect because of a phenomenon known as allelic association.
![Page 9: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/9.jpg)
Allelic Association
A
A
recombinantwith respect
to marker loci.
nonrecombinantwith respect
to marker loci.
![Page 10: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/10.jpg)
Consider two loci A and B. Suppose there are m alleles at locus A, denoted A1,
A2, ..., Am with allele frequencies pi.
Suppose there are n alleles at locus B, denoted B1, B2, ..., Bn with allele frequencies qj.
There are mn possible haplotypes constructed by combining one A allele and one B allele. Denote their frequencies by hij.
Allelic Association – Mathematical Definition
![Page 11: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/11.jpg)
Allelic Association – Mathematical Definition
If the two loci are independent (i.e. there is no association of their alleles), then
If hij > piqj, then there is a positive association between Ai and Bj.
If hij < piqj, then there is a negative association between Ai and Bj.
jijiijji qpBBAAhBBAA PP,P
![Page 12: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/12.jpg)
Mathematical Formulation for Linkage Disequilibrium
One cause of allelic association is linkage disequilibrium, which we can formalize in a mathematical treatment by defining the recombination fraction between two loci.
What happens in a single generation of random mating?
![Page 13: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/13.jpg)
Mathematical Formulation for Linkage Disequilibrium
001
01 1
ijjiijij
jiijij
hqphh
qphh
Linkage equilibrium: hij0 = piqj.
Linkage disequilibrium: hij0 piqj.
The amount of change from generation to generation is proportional to .
![Page 14: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/14.jpg)
Rate of Approach to Linkage Equilibrium
Linkage equilibrium is approached at a geometric rate.
Dij = hij – piqj is termed linkage disequilibrium, rather inappropriately.
jiij
kkjiij
jiijjiij
qphqph
qphqph
01
01
1
1
![Page 15: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/15.jpg)
Other Causes of Allelic Association
Random genetic drift: Each generation is created by (randomly) sampling from the alleles present in the preceding generation. Because populations are finite, there will be sampling variation introduced in allele and haplotype frequencies.
Mutation: Mutation is another random process that changes allele and haplotype frequencies.
![Page 16: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/16.jpg)
Other Causes of Allelic Association
Founder effect: If a population is initiated by a small group of individuals, this group will tend to be in substantial linkage disequilibrium which will take some time to dissipate.
In addition, if the population subsequently expands, the linkage disequilibrium will take even longer to dissipate. In fact, theory provides that the linkage disequilibrium will dissipate at the rate that it would for a population with effective size Ne approximately equal to the harmonic mean of the population size over all generations.
![Page 17: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/17.jpg)
Other Causes of Allelic Association
Selection: When the genotype affects the reproductive fitness of individuals in the population, then alleles at two loci which act synergistically to improve the individual’s phenotype will be on the over-represented haplotype in the population.
![Page 18: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/18.jpg)
Other Causes of Allelic Association
Population Subdivision: When the population is divided into subdivisions which do not interact (mate) for cultural or geographical reasons, then genetic drift will differentiate these populations. If the subpopulations are merged in the analysis, then artificial allelic associations will be generated.
![Page 19: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/19.jpg)
Subpopulations - Example
N A B AB
1000 0.2 0.8 0.16
1500 0.5 0.6 0.30
5000 0.01 0.4 0.004
7500 0.13 0.49 0.08
Expected: 0.06
![Page 20: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/20.jpg)
What is Allelic Association Then?
With so many other causes of allelic association, it is very difficult to determine the cause of a significant allelic association.
One must understand the population in order to use association studies.
If the contributions of random drift, mutation, selection, and population subdivision can be eliminated (or bounded below), then one can take any (or substantial) allelic association as evidence of linkage.
![Page 21: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/21.jpg)
Linkage Disequilibrium in Human Populations
Human populations are usually assumed to be in approximate linkage equilibrium for unselected marker loci. Therefore, only very tight linkage would be sustained over many generations.
Association studies are used for fine mapping of human diseases and traits. Start with family-based linkage analysis. Finish with an association study.
![Page 22: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/22.jpg)
Association Study on Random Population Sample
Because it is a random sample, it is infeasible to perform such an analysis on rare alleles or traits.
Usually used to analyze linkage of markers. Verify suspected linkage Obtain more accurate maps Test a population for the level of linkage
detectable in association study
![Page 23: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/23.jpg)
Random Population – General Model
Using the same framework described earlier, there are m(m+1)/2 genotypes possible at locus A, n(n+1)/2 at B. The total number of joint genotypes is the product.
We observe the counts nijkl of each of the possible joint genotypes, which follow a multinomial distribution with parameters n++++ and the genotypes frequencies gijkl.
1
2
1
2
1d.f. ˆ
loglog
3
,,,2
nnmm
n
ng
gngL
ijklijkl
lkjiijklijklijkl
![Page 24: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/24.jpg)
Random Population – HWE and LE Model
A submodel of the general model given above is that the population is in Hardy-Weinberg and linkage equilibrium.
2d.f. 2
ˆ
log,log
or
,,,0
mnn
n
n
np
qqppnqpL
qqppg
jisjsiij
sss
lkjilkjiijkl
lkjiijkl
![Page 25: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/25.jpg)
Random Population – Testing HWE & LE
2(logL2 – logL0) ~ 2 with m(m+1)n(n+1)/4 – n – m + 1 degrees of freedom.
Jointly tests the assumption of HWE and linkage equilibrium.
![Page 26: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/26.jpg)
Random Population – HWE Model
A submodel of the general model is that the population is in Hardy-Weinberg but not linkage equilibrium.
1d.f. ??ˆ
loglog
ow22
or 2or 2
,
,,,1
2
nmh
gnhL
hhhh
lkjihhhh
lkjih
g
ij
lkjiijklijkl
kjiljlik
jkikilik
ik
ijkl
![Page 27: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/27.jpg)
Random Population – Testing LE | HWE
2(logL1 – logL0) ~ 2 with (m – 1)(n – 1) degrees of freedom.
Tests the assumption of linkage equilibrium conditional on the the assumption of HWE.
![Page 28: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/28.jpg)
Random Population – Estimating hij
There is missing information because we cannot tell whether the doubly heterozygous genotype gijkl is made up of haplotypes hik and hjl or haplotypes hil and hjk. All other genotypes have identifiable haplotypes.
The stage is set for using the EM algorithm to obtain the hij maximum likelihood estimates.
Start by choosing initial values, reasonable to assume linkage equilibrium.
![Page 29: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/29.jpg)
Random Population – Conditional Probabilities
jkiljlij
jlij
jkilijkljlikijkl
jlikijklijklik
hhhh
hh
hhghhg
hhggh
/P/P
/PP
![Page 30: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/30.jpg)
Random Population – E Step
lkji
nhnhnhnh
nhnhnnn
nhnhnhnh
nhnhnnn
jkiljlik
jkilijkljkil
jkiljlik
jlikijkljlik
,
1
1
/
/
![Page 31: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/31.jpg)
Random Population – M Step
iikkijkkjlikik nnnnnh 2
1
2
11 /
![Page 32: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/32.jpg)
Random Population - Example
Sham, example 4.4 in section 4.5. 2(logL2 – logL1) = 9.40, df = 5, p = 0.09
2(logL1 – logL0) = 8.89, df = 1, p = 0.003
Conclusion: There may be linkage disequilibrium (test 2), but there is no substantial evidence for nonrandom mating (test 1).
![Page 33: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/33.jpg)
Association Study on Case-Control Sample
When a trait is rare, random sampling of the population is inefficient. It is more appropriate to sample affected individuals in the population and compare them to a random sample of unaffected members from the same population.
The approach is to derive the conditional probabilities of marker genotypes given the disease state, thus permitting genotype comparisons between the sample of affected individuals (cases) and the unaffected individuals (control).
![Page 34: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/34.jpg)
Case-Control – Derivation I
Suppose the rare trait is controlled by locus D with alleles D1 and D2.
Suppose that the penetrance parameters are
And the allele frequencies are
2222
2112
1111
P
P
P
DDAf
DDAf
DDAf
22
11
P
P
Dp
Dp
![Page 35: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/35.jpg)
Case-Control – Derivation II
22222112
2111
2222
22
122121
1121
11
2 where
P
2P
P
pfppfpfK
K
fpADD
K
fppADD
K
fpADD
![Page 36: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/36.jpg)
Case-Control – Derivation III
KQ
fs
fs
fs
1 and
1
1
1
Let
2222
1212
1111
Q
spUDD
Q
sppUDD
Q
spUDD
2222
22
122121
1121
11
P
2P
P
![Page 37: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/37.jpg)
Case-Control – Derivation IV
Q
sppsp
Q
spp
Q
spUD
Q
sppsp
Q
spp
Q
spUD
K
fppfp
K
fpp
K
fpAD
K
fppfp
K
fpp
K
fpAD
12212222122122
22
2
12211121122111
21
1
12212222122122
22
2
12211121122111
21
1
2
2
1P
2
2
1P
2
2
1P
2
2
1P
![Page 38: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/38.jpg)
Case-Control – Derivation V
Unless we are very lucky though, we are not genotyping at the disease/trait locus, rather at a marker which may or may not be linked to the disease/trait locus.
Thus, we must continue the derivation to determine the conditional genotype probabilities at the marker locus, not the disease/trait locus.
Let the marker locus be called B and suppose it has n possible alleles with frequencies qj.
![Page 39: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/39.jpg)
Case-Control – Derivation VI
The key in the analysis of association is that the trait locus D and marker locus B may not be in linkage equilibrium. This deviation from linkage equilibrium is quantitated through the haplotype frequencies hij and the fact that they are not just multiplicative functions of the allele frequencies.
![Page 40: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/40.jpg)
Case-Control – Derivation VII
ji
jiijjijiji
i
iiiiii
hhfhhhhfhhfBBA
q
hfhhfhfBBA
2
222P
2P
22222121121111
2
22222112
2111
![Page 41: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/41.jpg)
Case-Control – Derivation VIII
Q
hhshhhhshhsUBB
Q
hshhshsUBB
K
hhfhhhhfhhfABB
K
hfhhfhfABB
jiijjijiji
iiiiii
jiijjijiji
iiiiii
22222121121111
22222112
2111
22222121121111
22222112
2111
222P
2P
222P
2P
![Page 42: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/42.jpg)
Case-Control – Hypothesis Testing
Full Model (no assumptions): n(n+1)-1 degrees of freedom.
Restricted Model 1 (LE): 2n – 2 Restricted Model 2 (LE and HWE): n - 1
![Page 43: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/43.jpg)
Case-Control – Multiple Loci
Suppose that you have genotyped multiple loci and there is one locus underlying the trait of interest.
Then one can set up a model where all the loci are in linkage disequilibrium and a nested model where all the loci are in linkage disequilibrium except the locus of interest.
Test these nested models with G statistic to test for evidence of linkage of the loci to the disease.
![Page 44: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/44.jpg)
Case-Control – Unknown Genetic Model
When the penetrance parameters and allele frequencies are unknown, the above model gets you nowhere.
Treat the data as a traditional contingency table and test for significant difference in the two populations.
Allele Case Control
B1 1n1 2n1
B2 1n2 2n2
n-1degreesoffreedom
![Page 45: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/45.jpg)
Case-Control – Polymorphic Loci – The Problem
When the makers are very polymorphic, there are many, many possible genotypes. Not all will be observed in the data and some cells will have very low counts, even if we consider just allele frequencies.
![Page 46: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/46.jpg)
Case-Control – Polymorphic Loci - Solutions
Group alleles together until the counts are high enough to perform chi-squared test. Unfortunately, power is lost.
Compare each allele against the other alleles, resulting in a 2x2 table. There are multiple tests and the significance level must be adjusted. May also reduce power.
![Page 47: Lecture 18: Association Studies I](https://reader035.vdocuments.us/reader035/viewer/2022062322/56814c2b550346895db931a1/html5/thumbnails/47.jpg)
Case-Control – Polymorphic Loci - Solutions
Monte Carlo simulation or exact tests to determine significance of Pearson chi-squared statistic conditional on the marginal totals in the table.
There are other methods. Perhaps you’ll see them in homework...