computational problems in genetic linkage analysis dan geiger cs, technion

37
. Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan. Course homepage: http://www.cs.technion.ac.il/~fmaayan/cs

Upload: axelle

Post on 09-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion. This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

.

Computational Problems in Genetic Linkage Analysis

Dan GeigerCS, Technion

This talk is based mainly on work with Ma'ayan Fishelson. See bioinfo.cs.technion.ac.il/superlink/ for more details. Some slides are due to Gideon Greenspan.

Course homepage: http://www.cs.technion.ac.il/~fmaayan/cs236524/

Page 2: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

2

Requirements

•One homework assignment (10%).

•Mid term progress report.

•Submission in pairs.

•Well documented, tested, useful program.

•Self initiative (by reading and thinking) will be

rewarded.

•One or two excelling projects maybe selected for

continuation next semester under special projects

course, if desired.

Page 3: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

4

Part I: Quick look on relevant geneticsPart II: Case study: Werner’s syndrome Part III: Relevant Mathematics Part IV: Software description / algorithms

Gene Hunting: find genes responsible for a given diseaseMain idea: If a disease is statistically linked with a marker on a chromosome, then tentatively infer that a gene causing the disease is located near that marker.

Outline

Page 4: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

5

Human Genome

Most human cells contain

46 chromosomes:

2 sex chromosomes (X,Y):XY – in males.XX – in females.

22 pairs of chromosomes named autosomes.

Page 5: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

6

Chromosome Logical StructureMarker – Genes, Single Nucleotide

Polymorphisms, Tandem repeats, etc.

Locus – the location of markers on the chromosome.

Allele – one variant form (or state) of a gene/marker at a particular locus.

Locus1Possible Alleles: A1,A2

Locus2Possible Alleles: B1,B2,B3

Page 6: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

7

Alleles

b - dominant allele. Namely, (b,b), (b,w) is Black. w - recessive allele. Namely, only (w,w) is White.This is an example of an X-linked trait.For males b alone is Black and w alone is white.

genotype

phenotype

Page 7: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

8

Genotypes versus Phenotypes

At each locus (except for sex chromosomes) there are 2 genes. These constitute the individual’s genotype at the locus.

The expression of a genotype is termed a phenotype. For example, hair color, weight, or the presence or absence of a disease.

Page 8: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

10

Recombination Phenomenon

A recombination between2 genes occurred if thehaplotype of the individualcontains 2 alleles that resided in different haplotypes in theindividual's parent.

(Haplotype – the alleles at different loci that are received by an individual from one parent).

Male or female

:תאי מיןביצית, או זרע

Page 9: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

11

An example - the ABO locus. The ABO locus

determines detectable antigens on the surface of red blood cells.

The 3 major alleles (A,B,O) interact to determine the various ABO blood types.

O is recessive to A and B. Alleles A and B are codominant.

Phenotype Genotype

A A/A, A/O

B B/B, B/O

AB A/B

O O/O

Note: genotypes are unordered.

Page 10: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

12

Example: ABO, AK1 on Chromosome 9

Male recombination fraction 12/100 and female 20/100. These numbers are measured in centi-morgans. One centi-morgan means one recombination every 100 meiosis.

Ten centi-morgan corresponds to approx 1M nucleotides (with large variance) depending on the location and sex.

2

4

5

1

3

A

A1/A1

O

A2/A2

A

A1/A2

O

A1/A2

A

A2/A2

O OA1 A2

A OA1 A2

A | OA2 | A2

O OA2 A2

Recombinant

Phase inferred

Page 11: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

13

Example for Finding Disease Genes

We use a marker with codominant alleles A1/A2.

We speculate a locus with alleles H (Healthy) / D (affected)

If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

2

4

5

1

3

H

A1/A1

D

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

D DA2 A2

Recombinant

Phase inferred

Page 12: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

14

Recombination cannot be simply counted

2

4

5

1

3

H

A1/A1

H

A2/A2

H

A1/A2

D

A1/A2

H

A2/A2

D DA1 A2

H DA1 A2

H | DA2 | A2

Possible Recombinant

Phase ???

One can compute the probability that a recombination occurred and use this number as if this is the real count.

Page 13: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

15

Comments about the example

Often:

Pedigrees are larger and more complex. Not every individual is typed. Recombinants cannot always be

determined. There are more markers and they are polymorphic (have more than two

alleles).

Page 14: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

16

Genetic Linkage Analysis

The method just described is called genetic linkage analysis. It uses the phenomena of recombination in families of affected individuals to locate the vicinity of a disease gene.

Recombination fraction is measured in centi morgans and can change between males and females.

Next step: Once a suspected area is found, further studies check the 20-50 candidate genes in that area.

Linkage) No(5.0)ionRecombinat(0)Linkage( P

Page 15: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

17

Part II: Case study

Werner’s Syndrome

A successful application of genetic linkage analysis

Page 16: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

18

The Disease

First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in

wild-type and mutant proteins

Page 17: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

19

One Pedigree’s Data (out of 14)

1 115 0 0 2 1 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 126 0 0 1 1 0 0 1 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 1 0 1 111 0 0 1 1 0 1 0 1 2 0 2 0 3 1 2 1 1 1 3 1 2 1 0 0 1 0 1 0 0 0 1 122 111 115 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 125 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 121 111 115 1 1 2 0 1 2 1 2 3 3 1 2 1 1 1 3 2 2 1 0 0 1 0 1 0 1 0 1 1 135 126 122 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 131 121 125 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 141 131 135 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Pedigree number

Individual ID

Mother’s ID

Father’s ID

Sex: 1=male 2=female

Status: 1=healthy 2=diseased

Unknown marker alleles

Known marker alleles

Page 18: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

20

Marker File Input

14 0 0 50 0.0 0.0 01 2 3 4 5 6 7 8 9 10 11 12 13 141 20.995 0.00510 0 13 6 # D8S133 0.0200 0.3700 0.4050 0.0050 0.0500 0.0750

...[other 12 markers skipped]...

0 010 7.6 7.4 0.9 6.7 1.6 2.5 2.8 2.1 2.8 11.4 1 43.8 1 0.1 0.45

1 disease locus + 13 markers

Recessive disease

requires 2 mutant genes

First marker has 6 alleles

First markerfounder allele

frequencies

First marker’s

name

Recombination distances between

markers

Page 19: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

21

Genehunter Output

position LOD_score information 0.00 -1.254417 0.224384 1.52 2.836135 0.226379

...[other data skipped]...

18.58 13.688599 0.384088 19.92 14.238474 0.401992 21.26 14.718037 0.426818 22.60 15.159389 0.462284 22.92 15.056713 0.462510 23.24 14.928614 0.463208 23.56 14.754848 0.464387

...[other data skipped]...

81.84 1.939215 0.059748 90.60 -11.930449 0.087869

Putative distance of disease gene

from first marker in recombination

units

Log likelihood of placing disease

gene at distance, relative to it being

unlinked.

Maximum log likelihood score

Most ‘likely’ position

Page 20: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

22

Final Location

Marker D8S131

Marker D8S259

location of marker D8S339

WRN Gene final location

Error in location by genetic linkage of about 1.25M base pairs.

Page 21: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

23

Part III: Relevant Mathematics

Page 22: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

24

The Maximum Likelihood Approach

The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by .

How can we construct this likelihood function ?

The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This value is called the ML estimate.

The main computational difficulty is to compute Pr(data|) for a specific value of .

Page 23: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

25

Constructing the Likelihood function

Lijm (Lijf) = Maternal (paternal) allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

First, we need to determine the variables that describe the problem. There are several possible choices. Some variables we can observe and some we cannot.

Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i).

Page 24: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

26

Constructing the Likelihood functionL11fL11m

L13m

X11

S13m

Selector of maternal allele at locus 1 of person 3

Maternal allele at locus 1 of person 3 (offspring)

Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j.

P(s13m) = ½

P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m

P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f

P(l13m | l11m, l11f,,s13m) = 0 otherwise

Page 25: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

27

Probabilistic model for two lociS13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13Model for locus 1

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

Page 26: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

28

Probabilistic model for Recombination

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

{m,f}tssP tt

where

1

1),|(

22

2221323

2 is called the recombination fraction between loci 2 & 1.

females males

Page 27: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

29

The Data

The data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

Page 28: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

31

Constructing the likelihood function

= P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2)

P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,

s13m,s13f,s23m,s23f | 2) = Product over all local probability tables

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =

Probability of data (sum over all states of all hidden variables)

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m)

P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2) ]

The result is a function of the recombination fraction. The ML estimate is the 2 value that maximizes this function.

Page 29: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

32

Modeling PhenotypesL11fL11m

L13m

X11

S13m

Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities:P(y11 = sick | X11= (a,a)) = 1

P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0

Y11

Page 30: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

33

Standard usage of linkageThere are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps).

The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function.

This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).

Page 31: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

34

Part IV: Software and Algorithms

• Fastlink v4.1 (Each person’s genotype is one variable) • Vitesse v1,v2 (Only loopless Bayesian networks allowed)• GeneHunter/Alegro (exponential in number of persons)• Many more specific packages (e.g., affected siblings)• Superlink: Why is it better ?

For a list, See http://linkage.rockefeller.edu/soft/list.html

Page 32: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

35

SUPERLINK

Stage 1: each pedigree is translated into a Bayesian network. 

Stage 2: value elimination is performed on each

pedigree (i.e., some of the impossible values of the variables of the network are eliminated).

Stage 3: an elimination order for the variables is determined, according to some heuristic.

Stage 4: the likelihood of the pedigrees given the values is calculated using variable elimination according to the elimination order determined in stage 3.

Page 33: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

36

Experiment A• Same topology (57 people, no loops)• Increasing number of loci (each one with 4-5 alleles)• Run time is in seconds.

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

A0 2 0.03 0.12 0.27A1 5 0.1 3.77 0.31A2 6 0.14 79.32 0.39A3 7 0.42 0.69A4 8 0.36 2.81A5 10 1.19 84.66A6 12 4.65A7 14 3.01A8 18 20.98A9 37 8510.15

A10 38 10446.27A11 40

over 100 hours

Out-of-memory

Pedigree sizeToo big forGenehunter.

Page 34: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

37

Experiment C

Files No. of Run Time Run Time Run Time Run TimeLoci Superlink Fastlink Vitesse Genehunter

D0 100 0.16 (2 l.e.) 0.41 (99 l.e.)D1 110 0.2 (2 l.e.) 0.45 (109 l.e.)D2 120 0.21 (2 l.e.) 0.48 (119 l.e.)D3 130 0.22 (2 l.e.) 0.49 (129 l.e.)D4 140 0.24 (2 l.e.) 0.51 (139 l.e.)D5 150 0.25 (2 l.e.) 0.53 (149 l.e.)D6 160 0.27 (2 l.e.) 0.54 (159 l.e.)D7 170 0.3 (2 l.e.) 0.6 (169 l.e.)D8 180 0.3 (2 l.e.) 0.59 (179 l.e.)D9 190 0.32 (2 l.e.) 0.61 (189 l.e.)D10 200 0.34 (2 l.e.) 0.66 (199 l.e)D11 210 0.37 (2 l.e.) 0.67 (209 l.e)

• Same topology (5 people, no loops)• Increasing number of loci (each one with 3-6 alleles)• Run time is in seconds.

Out-of-memory

Bus error

Page 35: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

38

The computational task at hand

kx x x

n

iii paxPP

3 1 1

)|()|( data

lmnkjmk

ikllmn

ij CBAY

Multidimensional multiplication/summation:

kjk

ikij BAC Example: Matrix multiplication:

5011505050 xxx CBA versus 5011505050 xxx CBA

Page 36: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

39

Some options for improving efficiency

1. Performing approximate calculations of the likelihood.

2. Multiplying special probability matrices efficiently.

3. Grouping alleles together and removing inconsistent alleles.

4. Optimizing the elimination order of variables in a Bayesian network.

kx x x

n

iii paxPP

3 1 1

)|()|( data

Page 37: Computational Problems in Genetic Linkage Analysis Dan Geiger CS, Technion

40

Projects

 

Project No.

Project Subject

1 Performing approximate likelihood computations by using the method Iterative Join-Graph Propagation. (ps)

2Performing haplotyping on the input data, i.e., inferring the most likely haplotypes for the individuals in the input pedigrees. (ps)

3Performing approximate likelihood computations by using a heuristic which ignores extreme markers in the likelihood computation. (ps)