genetic linkage analysis using hmms lecture 7
DESCRIPTION
Genetic Linkage Analysis using HMMs Lecture 7. Prepared by Dan Geiger. - PowerPoint PPT PresentationTRANSCRIPT
.
Genetic Linkage Analysis using HMMs
Lecture 7
Prepared by Dan Geiger
2
Part I: Quick look on relevant geneticsPart II: The use of HMMs Part III: Case study: Werner’s syndrome
Gene Hunting: find genes responsible for a given disease
Main idea: If a disease is statistically linked with a marker on a chromosome, then tentatively infer that a gene causing the disease is located near that marker.
Outline
3
Chromosome Logical Structure
Locus – the location of markers on the chromosome.
Allele – one variant form (or state) of a gene/marker at a particular locus.
Locus1Possible Alleles: A1,A2
Locus2Possible Alleles: B1,B2,B3
By markers we mean genes, Single Nucleotide Polymorphisms, Tandem repeats, etc.
4
Phenotype versus Genotype The ABO locus
determines detectable antigens on the surface of red blood cells.
The 3 major alleles (A,B,O) determine the various ABO blood types.
O is recessive to A and B. A and B are dominant over O. Alleles A and B are codominant.
Phenotype Genotype
A A/A, A/O
B B/B, B/O
AB A/B
O O/O
Note: genotypes are unordered.
5
Recombination Phenomenon
A recombination between2 genes occurred if thehaplotype of the individualcontains 2 alleles that resided in different haplotypes in theindividual's parent.
(Haplotype – the alleles at different loci that are received by an individual from one parent).
Male or female
:תאי מיןביצית, או זרע
6
כרומוזומים הומולוגיים המראים כיאסמתה
הכיאסמה היא הביטוי הציטולוגי לשחלוף.
Homolog chromosomes showing Chaismata
Chaisma(ta) is the cellular expression of recombination.
Sister chromatids
7
Example: ABO and the AK1 marker on Chromosome 9
Recombination fraction = 16/100.
One centi-morgan means one recombination every 100 meiosis. In our case it is 16cM.
One centi-morgan corresponds to approx 1M nucleotides (with large variance) depending on location and sex.
2
4
5
1
3
A
A1/A1
O
A2/A2
A
A1/A2
O
A1/A2
A
A2/A2
O OA1 A2
A OA1 A2
A | OA2 | A2
O OA2 A2
Recombinant
Phase inferred
8
Example for Finding Disease Genes
We use a marker with codominant alleles A1/A2.
We speculate a locus with alleles H (Healthy) / D (affected)
If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.
2
4
5
1
3
H
A1/A1
D
A2/A2
H
A1/A2
D
A1/A2
H
A2/A2
D DA1 A2
H DA1 A2
H | DA2 | A2
D DA2 A2
Recombinant
Phase inferred
9
Recombination cannot be simply counted
2
4
5
1
3
H
A1/A1
H
A2/A2
H
A1/A2
D
A1/A2
H
A2/A2
D DA1 A2
H DA1 A2
H | DA2 | A2
Possible Recombinant
Phase ???
One can compute the probability that a recombination occurred and use this number as if this is the real count.
10
Comments about the example
Often:
Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be
determined. There are more markers and they are polymorphic (have more than two
alleles).
11
Genetic Linkage Analysis
The method just described is called genetic linkage analysis. It uses the phenomena of recombination in families of affected individuals to locate the vicinity of a disease gene.
Recombination fraction is measured in centi morgans and can change between males and females.
Next step: Once a suspected area is found, further studies check the 20-50 candidate genes in that area.
Linkage) No(5.0)ionRecombinat(0)Linkage( P
12
Part II: Mathematics and Algorithms
13
Using the Maximum Likelihood Approach
The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions (the unknown is denoted by ).
How can we construct this likelihood function ?
The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This is the ML estimate.
14
Constructing the Likelihood function
Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.
First, we determine the variables describing the problem.
Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). “The genotype” Yj = person j is affected/not affected. “The phenotype”.
Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .
Sijm = a binary variable {0,1} that determines which maternal allele is received from the mother. Similarly,
Sijf = a binary variable {0,1} that determines which paternal allele is received from the father.
It remains to specify the joint distribution that governs these variables. HMMs turn to be a reasonable choice.
15
The model
Locus 1
Locus 3 Locus 4
Si3
m
Li1
fL
i1m
Li3
m
Xi1
Si3
f
Li2
fL
i2m
Li3
f
Xi2
Xi3
Locus 2 (Disease)
Y3
y2
Y1
This model depicts the qualitative relations between the variables.We will now specify the joint distribution over these variables.
16
Probabilistic Model for Recombination
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
S13m
L11fL11m
L13m
X11 S13f
L12fL12m
L13f
X12
X13
{m,f}tssP tt
where
1
1),|( 1323
is the recombination fraction between loci 2 & 1.
Y2Y1
Y3
17
Details regarding the Loci
The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are connected to the Xij variables (only in the disease locus). For example, model of perfect recessive disease yields the penetrance probabilities:
P(y11 = sick | X11= (a,a)) = 1P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0
Li1fLi1m
Li3m
Xi1
Si3m
Y1
P(L11m=a) is the frequency of allele a. X11 is an unordered allele
pair at locus 1 of person 1 = “the data”. P(x11 | l11m, l11f) = 0 or 1 depending on consistency
18
Hidden Markov Model In our case
X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Yi-1 Xi Xi+1
X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1
The compounded variable Si = (Si,1,m,…,Si,n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,…,Xi,n) is the data regarding locus i. Similarly for the disease locus we use Yi.
To specify the HMM we now explicate the transition matrices from Si-1 to Si and the matrices P(xi|Si).
19
The transition matrix
{m,f}tssPii
iiitjitji
where
1
1),|( ,,1,,
Recall that we wrote:
All i are usually known except the one before the disease locus .Extending this matrix to the smallest inheritance vector (n=1), we get:
Let d=hamming distance between state si-1 and state si. Then the transition probability is given by i
d(1-i)2n-d
ii
iii
ii
iii
ii
iii
ii
iii
ifmfm ssssP
1
11
1
1
1
1
1
11
),,|,( 13132323
00 01 10 11
00
01
10
11
20
Probability of data in one locus given the inheritance vector (emission probabilities)
S23m
L21fL21m
L23m
X21 S23f
L22fL22m
L23f
X22
X23
Model for locus 2
P(x21, x22 , x23 |s23m,s23f) =
= P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)
P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)
l21m,l21f,l22m,l22f l22m,l22f
The five last terms are always zero-or-one, namely, indicator functions.
21
Probability of data in the disease locus given the inheritance vector (emission
probabilities)
P(y1, y2 , y3 |s23m,s23f) =
= P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m,
l21f)
P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)
l21m,l21f,l22m,l22f l22m,l22f ,x21,x22,x23
P(y1|x21) P(y2|x22) P(y3|x23)
22
X1 X3 Xi XlastX1 X2 Xi
X1 X3 Xi SlastS1 S2 Si
Finding the best location
Xi-1Y
Xi-1S
23
Finding the best location
X1 X3 Xi XlastX1 X2 Xi
X1 X3 Xi SlastS1 S2 Si
Xi-1Y
Xi-1S
Simplest algorithm: For each possible locations on the genetic map, place the disease locus, say in the middle, and compute using the forward algorithm, the probability of data given that location. Data here means one assignment for the Xi variables and for Y.
Choose the maximum of all options.
24
X1 X3 Xi XlastX1 X2 Xi
X1 X3 Xi SlastS1 S2 Si
Finding the best location
Xi-1Y
Xi-1S
Second algorithm: Run the forward-backward algorithm and store intermediate results. Use these to compute probability of data at each location, all at once. Choose the maximum of all options.
At each segment one can try several values for and choose the best. Or use EM to learn the best value.
25
Part III: Case study
Werner’s Syndrome
A successful application of genetic linkage analysis
using HMM software (GeneHunter)
26
The Disease
First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in
wild-type and mutant proteins
27
One Pedigree’s Data (out of 14)
1 115 0 0 2 1 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 126 0 0 1 1 0 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 1 0 1 111 0 0 1 1 0 1 0 1 2 0 2 0 3 1 2 1 1 1 3 1 2 1 0 0 1 0 1 0 0 0 1 122 111 115 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 125 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 121 111 115 1 1 2 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 1 135 126 122 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 131 121 125 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 141 131 135 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Pedigree number
Individual ID
Mother’s ID
Father’s ID
Sex: 1=male 2=female
Status: 1=healthy 2=diseased
Unknown marker alleles
Known marker alleles
28
Marker File Input
14 0 0 50 0.0 0.0 01 2 3 4 5 6 7 8 9 10 11 12 13 141 20.995 0.00510 0 13 6 # D8S133 0.0200 0.3700 0.4050 0.0050 0.0500 0.0750
...[other 12 markers skipped]...
0 010 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8 1 0.1 0.45
1 disease locus + 13 markers
Recessive disease
requires 2 mutant genes
First marker has 6 alleles
First markerfounder allele
frequencies
First marker’s
name
Recombination distances between
markers
29
Genehunter Output
position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379
...[other data skipped]...
18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387
...[other data skipped]...
81.84 1.939215 0.059748 90.60 -11.930449 0.087869
Putative distance of disease gene
from first marker in recombination
units
Log likelihood of placing disease
gene at distance, relative to it being
unlinked.
Maximum log likelihood score
Most ‘likely’ position
30
Marker Inter- Distance distance from first DHS133 0.0D8S136 7.6 7.6D8S137 7.4 15.0D8S131 0.9 15.9D8S339 6.7 22.6D8S259 1.6 24.2FGFR 2.5 26.7D8S255 2.8 29.5ANK 2.1 31.6PLAT 2.8 34.4D8S165 11.4 45.8D8S166 1.0 46.8D8S164 43.8 90.6
Locating the Marker
31
Final Location
Marker D8S131
Marker D8S259
location of marker D8S339
WRN Gene final location
Error in location by genetic linkage of about 1.25M base pairs.