1 a linear-time algorithm for the perfect phylogeny haplotyping (pph) problem zhihong ding, vladimir...

1

A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny

Haplotyping (PPH) ProblemHaplotyping (PPH) Problem

Zhihong Ding, Vladimir Filkov, Zhihong Ding, Vladimir Filkov, Dan GusfieldDan Gusfield

Department of Computer ScienceDepartment of Computer Science

University of California, DavisUniversity of California, Davis

RECOMB 2005RECOMB 2005

2

Haplotypes to GenotypesHaplotypes to Genotypes Each individual has two “copies” of Each individual has two “copies” of

each chromosome. each chromosome. At each site, each chromosome has At each site, each chromosome has

one of two states denoted by 0 and 1one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both For each site of an individual, if both

haplotypes have state 0, then the genotype haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. then the state of the genotype is 2.

3

Haplotypes to GenotypesHaplotypes to Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

Sites: 1 2 3 4 5 6 7 8 9

4

Genotypes to HaplotypesGenotypes to Haplotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.

5

Haplotype Inference Haplotype Inference ProblemProblem

For disease association studies, haplotype For disease association studies, haplotype data is more valuable than genotype data, data is more valuable than genotype data, but haplotype data is harder and more but haplotype data is harder and more expensive to collect than genotype data.expensive to collect than genotype data.

Haplotype Inference ProblemHaplotype Inference Problem: Given a : Given a set of set of nn genotypes, determine the original genotypes, determine the original set of set of nn haplotype pairs haplotype pairs that generated that generated the the nn genotypes. genotypes.

NIH leads HAPMAP project to find NIH leads HAPMAP project to find common haplotypes in the human common haplotypes in the human population.population.

6

Haplotype Inference Haplotype Inference ProblemProblem

If the genotype has state 2 at If the genotype has state 2 at kk sites, there are 2sites, there are 2k k –– 11 possible possible explaining haplotype pairs.explaining haplotype pairs.

How to determine which How to determine which haplotype pair is the original haplotype pair is the original one generating the genotypeone generating the genotype??

We need a model of haplotype We need a model of haplotype evolution to help solve the evolution to help solve the haplotype inference problem.haplotype inference problem.

7

The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype

EvolutionEvolution

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral haplotype

Extant haplotypes at the leaves

Site mutations on edges

8

Assumptions of Perfect Assumptions of Perfect Phylogeny ModelPhylogeny Model

No recombination, only No recombination, only mutation.mutation.

Infinite-site assumption: one Infinite-site assumption: one mutation per site.mutation per site.

9

The Perfect Phylogeny The Perfect Phylogeny HaplotypingHaplotyping

(PPH) Problem(PPH) ProblemGiven a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny

11 22

aa 22 22

bb 00 22

cc 11 00

11 22

aa 11 00

aa 00 11

bb 00 00

bb 00 11

cc 11 00

cc 11 00

1

c c a a

b

b

2

10 10 10 01 01

00

Genotype matrix

Haplotype matrix Perfect phylogeny

Site

10

Prior WorkPrior Work

Several existing algorithms that Several existing algorithms that solve the PPH problem, but none solve the PPH problem, but none of them is in linear time.of them is in linear time.

Our contribution:Our contribution: A linear time algorithm.A linear time algorithm. Our implementation is about 250 Our implementation is about 250

times faster than the fastest one of times faster than the fastest one of previous algorithms for large data previous algorithms for large data set.set.

11

A P-Class of PPH A P-Class of PPH SolutionsSolutions

11 22

3355

44

Genotype Genotype MatrixMatrix

2 2 2 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 2 2 2 0 2 2 0 0 2 2 0 0 2

00

One PPH One PPH SolutionSolution

rooroott

P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions

Each P-Class consists of two Each P-Class consists of two subtreessubtrees

Sites: 1 2 3 Sites: 1 2 3 4 54 5

GenotypGenotypeses

aa

bb cc

dd

a,d

a,c

b,d

b,c

12

P-Class Property of PPH P-Class Property of PPH SolutionsSolutions

Second PPH Second PPH SolutionsSolutions

All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.


11 22

3355

44rooroo

tt

a,d

a,cb,c

b,d22

33

44

a,cb,d

rooroott11

a,d55

b,c

SwitchiSwitching ng pointpointss

SwitchiSwitching ng pointpointss

13

The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained

by choosing a flip for each P-Class.by choosing a flip for each P-Class.

Conversely, after fixing one P-Conversely, after fixing one P-Class, every distinct choice of flips Class, every distinct choice of flips of P-Classes, leads to a distinct of P-Classes, leads to a distinct PPH solution.PPH solution.

If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.

14

Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a

subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger

classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow

edgesedges

15

The AlgorithmThe Algorithm Process the genotype matrix Process the genotype matrix

one row at a time, starting at one row at a time, starting at the first row, and modify the the first row, and modify the shadow treeshadow tree

The genotype matrix only The genotype matrix only contains entries of value 0 and contains entries of value 0 and 2.2.

16

Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row

Procedure FirstPathProcedure FirstPath

Procedure SecondPathProcedure SecondPath

Procedure FixTreeProcedure FixTree

Procedure NewEntriesProcedure NewEntries

17

OldEntryListOldEntryList


2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55

OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row have entries of value 2 in this row and also have entries of value 2 in and also have entries of value 2 in some previous rowssome previous rows

33

18

Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath

FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible

SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path

19

Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two

RowsRows rootroot

11 11

44

55

22

33


2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

33

11

22

OldEntryList for OldEntryList for row 3 : row 3 : 11, , 22, , 33, , 55

22

33

44

55

20

Algorithm – FirstPathAlgorithm – FirstPath

rootroot

11 11

44

55

22

33

22

33

44

55

OldEntryLOldEntryList:ist:CheckListCheckList: : 33

, , 22

22,, 33,, 5511,,

Edges Edges 44 and and 55 cannot be cannot be on the same on the same path to the path to the root in any root in any PPH solutionPPH solution

21

Algorithm – SecondPathAlgorithm – SecondPath

rootroot

11 11

44

55

22

33

22

33

44

55

CheckLCheckList: ist:

33

OldEntryList: OldEntryList: 11, , 22, , 33, , 55 22

,,

22

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot

11 11

44

55

22

33

22

33

44

55


2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 02 0 22 2 2 0 0 2 00 0 2 0


Sites: 1 2 3 Sites: 1 2 3 4 54 5aa

bb

cc

dd

Final shadow treeFinal shadow tree

11

55

22

3344

23

Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions

rootroot

1111

44

55

22

33

22

33

44

55Second PPH Second PPH

SolutionSolutionFinal shadow treeFinal shadow tree

55

33

11

2244a,da,d

b,cb,c

b,db,da,ca,c

24

Implementation – Leaf Implementation – Leaf CountCount

Leaf count of column Leaf count of column ii (L[ (L[ ii

]): the number of 2's plus ]): the number of 2's plus twice the number of 1's in twice the number of 1's in column column ii..

L[L[ ii ] is the number of ] is the number of leaves below mutation leaves below mutation ii, in , in everyevery perfect phylogeny perfect phylogeny for the genotype matrix.for the genotype matrix.

Along Along anyany path to the root path to the root in in anyany PPH solution, the PPH solution, the successive edges are successive edges are labeled by columns with labeled by columns with strictly increasing leaf strictly increasing leaf counts.counts.

11 22 33 44

aa 11 11 00 00

bb 00 22 22 00

cc 22 00 22 00

dd 22 00 00 22

4 3 2 1Leaf Count:

25

Time ComplexityTime Complexity Constant number of simple Constant number of simple

operations on each edge per rowoperations on each edge per row Each traversal in the shadow tree Each traversal in the shadow tree

goes through O(goes through O(mm) edges.) edges. The algorithm does constant The algorithm does constant

number of traversals in the number of traversals in the shadow tree for each row.shadow tree for each row.

Total time: O(Total time: O(nn mm))n, m are the number of rows and columns in the genotype matrix.

26

ResultsResults

Average Running Times (seconds)

Sites (m)

Individuals (n)

Dataset DPPH O(nm2) Our Alg. O(nm)

300 150 30 1.07 0.05

500 250 30 5.72 0.13

1000 500 30 45.85 0.48

2000 1000 10 467.18 1.89

27

Thank you !Thank you !

Paper and program can be Paper and program can be downloaded at:downloaded at:

http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/

1 a linear-time algorithm for the perfect phylogeny haplotyping (pph) problem zhihong ding, vladimir...

Documents