genetic linkage analysis using hmms lecture 7

.

Genetic Linkage Analysis using HMMs

Lecture 7

Prepared by Dan Geiger

2

Part I: Quick look on relevant geneticsPart II: The use of HMMs Part III: Case study: Werner’s syndrome

Gene Hunting: find genes responsible for a given disease

Main idea: If a disease is statistically linked with a marker on a chromosome, then tentatively infer that a gene causing the disease is located near that marker.

Outline

3

Chromosome Logical Structure

Locus – the location of markers on the chromosome.

Allele – one variant form (or state) of a gene/marker at a particular locus.

Locus1Possible Alleles: A1,A2

Locus2Possible Alleles: B1,B2,B3

By markers we mean genes, Single Nucleotide Polymorphisms, Tandem repeats, etc.

4

Phenotype versus Genotype The ABO locus

determines detectable antigens on the surface of red blood cells.

The 3 major alleles (A,B,O) determine the various ABO blood types.

O is recessive to A and B. A and B are dominant over O. Alleles A and B are codominant.

Phenotype Genotype

A A/A, A/O

B B/B, B/O

AB A/B

O O/O

Note: genotypes are unordered.

5

Recombination Phenomenon

A recombination between2 genes occurred if thehaplotype of the individualcontains 2 alleles that resided in different haplotypes in theindividual's parent.

(Haplotype – the alleles at different loci that are received by an individual from one parent).

Male or female

:תאי מיןביצית, או זרע

6

כרומוזומים הומולוגיים המראים כיאסמתה

הכיאסמה היא הביטוי הציטולוגי לשחלוף.

Homolog chromosomes showing Chaismata

Chaisma(ta) is the cellular expression of recombination.

Sister chromatids

7

Example: ABO and the AK1 marker on Chromosome 9

Recombination fraction = 16/100.

One centi-morgan means one recombination every 100 meiosis. In our case it is 16cM.

One centi-morgan corresponds to approx 1M nucleotides (with large variance) depending on location and sex.

2

4

5

1

3

A

A1/A1

O

A2/A2

A

A1/A2

O

A1/A2

A

A2/A2

O OA1 A2

A OA1 A2

A | OA2 | A2

O OA2 A2

Recombinant

Phase inferred

8

Example for Finding Disease Genes

We use a marker with codominant alleles A1/A2.

We speculate a locus with alleles H (Healthy) / D (affected)

If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

9

Recombination cannot be simply counted

2

4

5

1

3

H

A1/A1

H

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

Possible Recombinant

Phase ???

One can compute the probability that a recombination occurred and use this number as if this is the real count.

10

Comments about the example

Often:

Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be

determined. There are more markers and they are polymorphic (have more than two

alleles).

11

Genetic Linkage Analysis

The method just described is called genetic linkage analysis. It uses the phenomena of recombination in families of affected individuals to locate the vicinity of a disease gene.

Recombination fraction is measured in centi morgans and can change between males and females.

Next step: Once a suspected area is found, further studies check the 20-50 candidate genes in that area.

Linkage) No(5.0)ionRecombinat(0)Linkage( P

12

Part II: Mathematics and Algorithms

13

Using the Maximum Likelihood Approach

The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions (the unknown is denoted by ).

How can we construct this likelihood function ?

The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This is the ML estimate.

14

Constructing the Likelihood function

Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

First, we determine the variables describing the problem.

Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). “The genotype” Yj = person j is affected/not affected. “The phenotype”.

Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .

Sijm = a binary variable {0,1} that determines which maternal allele is received from the mother. Similarly,

Sijf = a binary variable {0,1} that determines which paternal allele is received from the father.

It remains to specify the joint distribution that governs these variables. HMMs turn to be a reasonable choice.

15

The model

Locus 1

Locus 3 Locus 4

Si3

m

Li1

fL

i1m

Li3

m

Xi1

Si3

f

Li2

fL

i2m

Li3

f

Xi2

Xi3

Locus 2 (Disease)

Y3

y2

Y1

This model depicts the qualitative relations between the variables.We will now specify the joint distribution over these variables.

16

Probabilistic Model for Recombination

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

{m,f}tssP tt

where

1

1),|( 1323

is the recombination fraction between loci 2 & 1.

Y2Y1

Y3

17

Details regarding the Loci

The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are connected to the Xij variables (only in the disease locus). For example, model of perfect recessive disease yields the penetrance probabilities:

P(y11 = sick | X11= (a,a)) = 1P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0

Li1fLi1m

Li3m

Xi1

Si3m

Y1

P(L11m=a) is the frequency of allele a. X11 is an unordered allele

pair at locus 1 of person 1 = “the data”. P(x11 | l11m, l11f) = 0 or 1 depending on consistency

18

Hidden Markov Model In our case

X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Yi-1 Xi Xi+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

The compounded variable Si = (Si,1,m,…,Si,n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,…,Xi,n) is the data regarding locus i. Similarly for the disease locus we use Yi.

To specify the HMM we now explicate the transition matrices from Si-1 to Si and the matrices P(xi|Si).

19

The transition matrix

{m,f}tssPii

iiitjitji

where

1

1),|( ,,1,,

Recall that we wrote:

All i are usually known except the one before the disease locus .Extending this matrix to the smallest inheritance vector (n=1), we get:

Let d=hamming distance between state si-1 and state si. Then the transition probability is given by i

d(1-i)2n-d

ii

iii

ii

iii

ii

iii

ii

iii

ifmfm ssssP

1

11

1

1

1

1

1

11

),,|,( 13132323

00 01 10 11

00

01

10

11

20

Probability of data in one locus given the inheritance vector (emission probabilities)

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

P(x21, x22 , x23 |s23m,s23f) =

= P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)

P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)

l21m,l21f,l22m,l22f l22m,l22f

The five last terms are always zero-or-one, namely, indicator functions.

22

X1 X3 Xi XlastX1 X2 Xi

X1 X3 Xi SlastS1 S2 Si

Finding the best location

Xi-1Y

Xi-1S

23




Xi-1Y

Xi-1S

Simplest algorithm: For each possible locations on the genetic map, place the disease locus, say in the middle, and compute using the forward algorithm, the probability of data given that location. Data here means one assignment for the Xi variables and for Y.

Choose the maximum of all options.

24




Xi-1Y

Xi-1S

Second algorithm: Run the forward-backward algorithm and store intermediate results. Use these to compute probability of data at each location, all at once. Choose the maximum of all options.

At each segment one can try several values for and choose the best. Or use EM to learn the best value.

25

Part III: Case study

Werner’s Syndrome

A successful application of genetic linkage analysis

using HMM software (GeneHunter)

26

The Disease

First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in

wild-type and mutant proteins

27

One Pedigree’s Data (out of 14)

1 115 0 0 2 1 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 126 0 0 1 1 0 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 1 0 1 111 0 0 1 1 0 1 0 1 2 0 2 0 3 1 2 1 1 1 3 1 2 1 0 0 1 0 1 0 0 0 1 122 111 115 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 125 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 121 111 115 1 1 2 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 1 135 126 122 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 131 121 125 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 141 131 135 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Pedigree number

Individual ID

Mother’s ID

Father’s ID

Sex: 1=male 2=female

Status: 1=healthy 2=diseased

Unknown marker alleles

Known marker alleles

28

Marker File Input

14 0 0 50 0.0 0.0 01 2 3 4 5 6 7 8 9 10 11 12 13 141 20.995 0.00510 0 13 6 # D8S133 0.0200 0.3700 0.4050 0.0050 0.0500 0.0750

...[other 12 markers skipped]...

0 010 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8 1 0.1 0.45

1 disease locus + 13 markers

Recessive disease

requires 2 mutant genes

First marker has 6 alleles

First markerfounder allele

frequencies

First marker’s

name

Recombination distances between

markers

29

Genehunter Output

position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379

...[other data skipped]...

18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387

...[other data skipped]...

81.84 1.939215 0.059748 90.60 -11.930449 0.087869

Putative distance of disease gene

from first marker in recombination

units

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

Most ‘likely’ position

30

Marker Inter- Distance distance from first DHS133 0.0D8S136 7.6 7.6D8S137 7.4 15.0D8S131 0.9 15.9D8S339 6.7 22.6D8S259 1.6 24.2FGFR 2.5 26.7D8S255 2.8 29.5ANK 2.1 31.6PLAT 2.8 34.4D8S165 11.4 45.8D8S166 1.0 46.8D8S164 43.8 90.6

Locating the Marker

31

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.

genetic linkage analysis using hmms lecture 7

Documents