computational problems in genetic linkage analysis dan geiger cs, technion

Post on 09-Jan-2016

26 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion. This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan. - PowerPoint PPT Presentation

TRANSCRIPT

.

Computational Problems in Genetic Linkage Analysis

Dan GeigerCS, Technion

This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan.

Course homepage: http://www.cs.technion.ac.il/~fmaayan/cs236524/

2

Requirements

•One homework assignment (10%).

•Mid term progress report.

•Submission in pairs.

•Well documented, tested, useful program.

•Self initiative (by reading and thinking) will be

rewarded.

•One or two excelling projects maybe selected for

continuation next semester under special projects

course, if desired.

4

Part I: Quick look on relevant geneticsPart II: Case study: Werner’s syndrome Part III: Relevant Mathematics Part IV: Software description / algorithms

Gene Hunting: find genes responsible for a given diseaseMain idea: If a disease is statistically linked with a marker on a chromosome, then tentatively infer that a gene causing the disease is located near that marker.

Outline

5

Human Genome

Most human cells contain

46 chromosomes:

2 sex chromosomes (X,Y):XY – in males.XX – in females.

22 pairs of chromosomes named autosomes.

6

Chromosome Logical StructureMarker – Genes, Single Nucleotide

Polymorphisms, Tandem repeats, etc.

Locus – the location of markers on the chromosome.

Allele – one variant form (or state) of a gene/marker at a particular locus.

Locus1Possible Alleles: A1,A2

Locus2Possible Alleles: B1,B2,B3

7

Alleles

b - dominant allele. Namely, (b,b), (b,w) is Black. w - recessive allele. Namely, only (w,w) is White.This is an example of an X-linked trait.For males b alone is Black and w alone is white.

genotype

phenotype

8

Genotypes versus Phenotypes

At each locus (except for sex chromosomes) there are 2 genes. These constitute the individual’s genotype at the locus.

The expression of a genotype is termed a phenotype. For example, hair color, weight, or the presence or absence of a disease.

10

Recombination Phenomenon

A recombination between2 genes occurred if thehaplotype of the individualcontains 2 alleles that resided in different haplotypes in theindividual's parent.

(Haplotype – the alleles at different loci that are received by an individual from one parent).

Male or female

:תאי מיןביצית, או זרע

11

An example - the ABO locus. The ABO locus

determines detectable antigens on the surface of red blood cells.

The 3 major alleles (A,B,O) interact to determine the various ABO blood types.

O is recessive to A and B. Alleles A and B are codominant.

Phenotype Genotype

A A/A, A/O

B B/B, B/O

AB A/B

O O/O

Note: genotypes are unordered.

12

Example: ABO, AK1 on Chromosome 9

Male recombination fraction 12/100 and female 20/100. These numbers are measured in centi-morgans. One centi-morgan means one recombination every 100 meiosis.

Ten centi-morgan corresponds to approx 1M nucleotides (with large variance) depending on the location and sex.

2

4

5

1

3

A

A1/A1

O

A2/A2

A

A1/A2

O

A1/A2

A

A2/A2

O OA1 A2

A OA1 A2

A | OA2 | A2

O OA2 A2

Recombinant

Phase inferred

13

Example for Finding Disease Genes

We use a marker with codominant alleles A1/A2.

We speculate a locus with alleles H (Healthy) / D (affected)

If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

14

Recombination cannot be simply counted

2

4

5

1

3

H

A1/A1

H

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

Possible Recombinant

Phase ???

One can compute the probability that a recombination occurred and use this number as if this is the real count.

15

Comments about the example

Often:

Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be

determined. There are more markers and they are polymorphic (have more than two

alleles).

16

Genetic Linkage Analysis

The method just described is called genetic linkage analysis. It uses the phenomena of recombination in families of affected individuals to locate the vicinity of a disease gene.

Recombination fraction is measured in centi morgans and can change between males and females.

Next step: Once a suspected area is found, further studies check the 20-50 candidate genes in that area.

Linkage) No(5.0)ionRecombinat(0)Linkage( P

17

Part II: Case study

Werner’s Syndrome

A successful application of genetic linkage analysis

18

The Disease

First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in

wild-type and mutant proteins

19

One Pedigree’s Data (out of 14)

1 115 0 0 2 1 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 126 0 0 1 1 0 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 1 0 1 111 0 0 1 1 0 1 0 1 2 0 2 0 3 1 2 1 1 1 3 1 2 1 0 0 1 0 1 0 0 0 1 122 111 115 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 125 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 121 111 115 1 1 2 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 1 135 126 122 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 131 121 125 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 141 131 135 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Pedigree number

Individual ID

Mother’s ID

Father’s ID

Sex: 1=male 2=female

Status: 1=healthy 2=diseased

Unknown marker alleles

Known marker alleles

20

Marker File Input

14 0 0 50 0.0 0.0 01 2 3 4 5 6 7 8 9 10 11 12 13 141 20.995 0.00510 0 13 6 # D8S133 0.0200 0.3700 0.4050 0.0050 0.0500 0.0750

...[other 12 markers skipped]...

0 010 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8 1 0.1 0.45

1 disease locus + 13 markers

Recessive disease

requires 2 mutant genes

First marker has 6 alleles

First markerfounder allele

frequencies

First marker’s

name

Recombination distances between

markers

21

Genehunter Output

position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379

...[other data skipped]...

18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387

...[other data skipped]...

81.84 1.939215 0.059748 90.60 -11.930449 0.087869

Putative distance of disease gene

from first marker in recombination

units

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

Most ‘likely’ position

22

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.

23

Part III: Relevant Mathematics

24

The Maximum Likelihood Approach

The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by .

How can we construct this likelihood function ?

The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This value is called the ML estimate.

The main computational difficulty is to compute Pr(data|) for a specific value of .

25

Constructing the Likelihood function

Lijm (Lijf) = Maternal (paternal) allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

First, we need to determine the variables that describe the problem. There are several possible choices. Some variables we can observe and some we cannot.

Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i).

26

Constructing the Likelihood functionL11fL11m

L13m

X11

S13m

Selector of maternal allele at locus 1 of person 3

Maternal allele at locus 1 of person 3 (offspring)

Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j.

P(s13m) = ½

P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m

P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f

P(l13m | l11m, l11f,,s13m) = 0 otherwise

27

Probabilistic model for two lociS13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13Model for locus 1

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

28

Probabilistic model for Recombination

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

{m,f}tssP tt

where

1

1),|(

22

2221323

2 is called the recombination fraction between loci 2 & 1.

females males

29

The Data

The data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

31

Constructing the likelihood function

= P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2)

P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,

s13m,s13f,s23m,s23f | 2) = Product over all local probability tables

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =

Probability of data (sum over all states of all hidden variables)

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m)

P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2) ]

The result is a function of the recombination fraction. The ML estimate is the 2 value that maximizes this function.

32

Modeling PhenotypesL11fL11m

L13m

X11

S13m

Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities:P(y11 = sick | X11= (a,a)) = 1

P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0

Y11

33

Standard usage of linkageThere are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps).

The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function.

This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).

34

Part IV: Software and Algorithms

• Fastlink v4.1 (Each person’s genotype is one variable) • Vitesse v1,v2 (Only loopless Bayesian networks allowed)• GeneHunter/Alegro (exponential in number of persons)• Many more specific packages (e.g., affected siblings)• Superlink: Why is it better ?

For a list, See http://linkage.rockefeller.edu/soft/list.html

35

SUPERLINK

Stage 1: each pedigree is translated into a Bayesian network. 

Stage 2: value elimination is performed on each

pedigree (i.e., some of the impossible values of the variables of the network are eliminated).

Stage 3: an elimination order for the variables is determined, according to some heuristic.

Stage 4: the likelihood of the pedigrees given the values is calculated using variable elimination according to the elimination order determined in stage 3.

36

Experiment A• Same topology (57 people, no loops)• Increasing number of loci (each one with 4-5 alleles)• Run time is in seconds.

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

A0 2 0.03 0.12 0.27A1 5 0.1 3.77 0.31A2 6 0.14 79.32 0.39A3 7 0.42 0.69A4 8 0.36 2.81A5 10 1.19 84.66A6 12 4.65A7 14 3.01A8 18 20.98A9 37 8510.15

A10 38 10446.27A11 40

over 100 hours

Out-of-memory

Pedigree sizeToo big forGenehunter.

37

Experiment C

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

D0 100 0.16 (2 l.e.) 0.41 (99 l.e.)D1 110 0.2 (2 l.e.) 0.45 (109 l.e.)D2 120 0.21 (2 l.e.) 0.48 (119 l.e.)D3 130 0.22 (2 l.e.) 0.49 (129 l.e.)D4 140 0.24 (2 l.e.) 0.51 (139 l.e.)D5 150 0.25 (2 l.e.) 0.53 (149 l.e.)D6 160 0.27 (2 l.e.) 0.54 (159 l.e.)D7 170 0.3 (2 l.e.) 0.6 (169 l.e.)D8 180 0.3 (2 l.e.) 0.59 (179 l.e.)D9 190 0.32 (2 l.e.) 0.61 (189 l.e.)D10 200 0.34 (2 l.e.) 0.66 (199 l.e)D11 210 0.37 (2 l.e.) 0.67 (209 l.e)

• Same topology (5 people, no loops)• Increasing number of loci (each one with 3-6 alleles)• Run time is in seconds.

Out-of-memory

Bus error

38

The computational task at hand

kx x x

n

iii paxPP

3 1 1

)|()|( data

lmnkjmk

ikllmn

ij CBAY

Multidimensional multiplication/summation:

kjk

ikij BAC Example: Matrix multiplication:

5011505050 xxx CBA versus 5011505050 xxx CBA

39

Some options for improving efficiency

1. Performing approximate calculations of the likelihood.

2. Multiplying special probability matrices efficiently.

3. Grouping alleles together and removing inconsistent alleles.

4. Optimizing the elimination order of variables in a Bayesian network.

kx x x

n

iii paxPP

3 1 1

)|()|( data

40

Projects

 

Project No.

Project Subject

1 Performing approximate likelihood computations by using the method Iterative Join-Graph Propagation. (ps)

2Performing haplotyping on the input data, i.e., inferring the most likely haplotypes for the individuals in the input pedigrees. (ps)

3Performing approximate likelihood computations by using a heuristic which ignores extreme markers in the likelihood computation. (ps)

top related