1 a linear-time algorithm for the perfect phylogeny haplotyping (pph) problem zhihong ding, vladimir...
Post on 19-Dec-2015
220 views
TRANSCRIPT
1
A Linear-Time Algorithm A Linear-Time Algorithm for the Perfect Phylogeny for the Perfect Phylogeny
Haplotyping (PPH) ProblemHaplotyping (PPH) Problem
Zhihong Ding, Vladimir Filkov, Zhihong Ding, Vladimir Filkov, Dan GusfieldDan Gusfield
Department of Computer ScienceDepartment of Computer Science
University of California, DavisUniversity of California, Davis
RECOMB 2005RECOMB 2005
2
Haplotypes to GenotypesHaplotypes to Genotypes Each individual has two “copies” of Each individual has two “copies” of
each chromosome. each chromosome. At each site, each chromosome has At each site, each chromosome has
one of two states denoted by 0 and 1one of two states denoted by 0 and 1 From haplotypes to genotypes: From haplotypes to genotypes: For each site of an individual, if both For each site of an individual, if both
haplotypes have state 0, then the genotype haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2. then the state of the genotype is 2.
3
Haplotypes to GenotypesHaplotypes to Genotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
Merge the haplotypes
Sites: 1 2 3 4 5 6 7 8 9
4
Genotypes to HaplotypesGenotypes to Haplotypes
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0
Two haplotypes per individual
Genotype for the individual
For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0.
5
Haplotype Inference Haplotype Inference ProblemProblem
For disease association studies, haplotype For disease association studies, haplotype data is more valuable than genotype data, data is more valuable than genotype data, but haplotype data is harder and more but haplotype data is harder and more expensive to collect than genotype data.expensive to collect than genotype data.
Haplotype Inference ProblemHaplotype Inference Problem: Given a : Given a set of set of nn genotypes, determine the original genotypes, determine the original set of set of nn haplotype pairs haplotype pairs that generated that generated the the nn genotypes. genotypes.
NIH leads HAPMAP project to find NIH leads HAPMAP project to find common haplotypes in the human common haplotypes in the human population.population.
6
Haplotype Inference Haplotype Inference ProblemProblem
If the genotype has state 2 at If the genotype has state 2 at kk sites, there are 2sites, there are 2k k –– 11 possible possible explaining haplotype pairs.explaining haplotype pairs.
How to determine which How to determine which haplotype pair is the original haplotype pair is the original one generating the genotypeone generating the genotype??
We need a model of haplotype We need a model of haplotype evolution to help solve the evolution to help solve the haplotype inference problem.haplotype inference problem.
7
The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype
EvolutionEvolution
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral haplotype
Extant haplotypes at the leaves
Site mutations on edges
8
Assumptions of Perfect Assumptions of Perfect Phylogeny ModelPhylogeny Model
No recombination, only No recombination, only mutation.mutation.
Infinite-site assumption: one Infinite-site assumption: one mutation per site.mutation per site.
9
The Perfect Phylogeny The Perfect Phylogeny HaplotypingHaplotyping
(PPH) Problem(PPH) ProblemGiven a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny
11 22
aa 22 22
bb 00 22
cc 11 00
11 22
aa 11 00
aa 00 11
bb 00 00
bb 00 11
cc 11 00
cc 11 00
1
c c a a
b
b
2
10 10 10 01 01
00
Genotype matrix
Haplotype matrix Perfect phylogeny
Site
10
Prior WorkPrior Work
Several existing algorithms that Several existing algorithms that solve the PPH problem, but none solve the PPH problem, but none of them is in linear time.of them is in linear time.
Our contribution:Our contribution: A linear time algorithm.A linear time algorithm. Our implementation is about 250 Our implementation is about 250
times faster than the fastest one of times faster than the fastest one of previous algorithms for large data previous algorithms for large data set.set.
11
A P-Class of PPH A P-Class of PPH SolutionsSolutions
11 22
3355
44
Genotype Genotype MatrixMatrix
2 2 2 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 2 2 2 0 2 2 0 0 2 2 0 0 2
00
One PPH One PPH SolutionSolution
rooroott
P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions
Each P-Class consists of two Each P-Class consists of two subtreessubtrees
Sites: 1 2 3 Sites: 1 2 3 4 54 5
GenotypGenotypeses
aa
bb cc
dd
a,d
a,c
b,d
b,c
12
P-Class Property of PPH P-Class Property of PPH SolutionsSolutions
Second PPH Second PPH SolutionsSolutions
All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.
One PPH One PPH SolutionSolution
11 22
3355
44rooroo
tt
a,d
a,cb,c
b,d22
33
44
a,cb,d
rooroott11
a,d55
b,c
SwitchiSwitching ng pointpointss
SwitchiSwitching ng pointpointss
13
The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained
by choosing a flip for each P-Class.by choosing a flip for each P-Class.
Conversely, after fixing one P-Conversely, after fixing one P-Class, every distinct choice of flips Class, every distinct choice of flips of P-Classes, leads to a distinct of P-Classes, leads to a distinct PPH solution.PPH solution.
If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.
14
Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a
subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger
classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow
edgesedges
15
The AlgorithmThe Algorithm Process the genotype matrix Process the genotype matrix
one row at a time, starting at one row at a time, starting at the first row, and modify the the first row, and modify the shadow treeshadow tree
The genotype matrix only The genotype matrix only contains entries of value 0 and contains entries of value 0 and 2.2.
16
Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row
Procedure FirstPathProcedure FirstPath
Procedure SecondPathProcedure SecondPath
Procedure FixTreeProcedure FixTree
Procedure NewEntriesProcedure NewEntries
17
OldEntryListOldEntryList
Genotype Genotype MatrixMatrix
2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0
OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55
OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row have entries of value 2 in this row and also have entries of value 2 in and also have entries of value 2 in some previous rowssome previous rows
33
18
Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath
FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible
SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path
19
Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two
RowsRows rootroot
11 11
44
55
22
33
Genotype Genotype MatrixMatrix
2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0
33
11
22
OldEntryList for OldEntryList for row 3 : row 3 : 11, , 22, , 33, , 55
22
33
44
55
20
Algorithm – FirstPathAlgorithm – FirstPath
rootroot
11 11
44
55
22
33
22
33
44
55
OldEntryLOldEntryList:ist:CheckListCheckList: : 33
, , 22
22,, 33,, 5511,,
Edges Edges 44 and and 55 cannot be cannot be on the same on the same path to the path to the root in any root in any PPH solutionPPH solution
21
Algorithm – SecondPathAlgorithm – SecondPath
rootroot
11 11
44
55
22
33
22
33
44
55
CheckLCheckList: ist:
33
OldEntryList: OldEntryList: 11, , 22, , 33, , 55 22
,,
22
Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions
rootroot
11 11
44
55
22
33
22
33
44
55
Genotype Genotype MatrixMatrix
2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 02 0 22 2 2 0 0 2 00 0 2 0
One PPH One PPH SolutionSolution
Sites: 1 2 3 Sites: 1 2 3 4 54 5aa
bb
cc
dd
Final shadow treeFinal shadow tree
11
55
22
3344
23
Shadow Tree to PPH Shadow Tree to PPH SolutionsSolutions
rootroot
1111
44
55
22
33
22
33
44
55Second PPH Second PPH
SolutionSolutionFinal shadow treeFinal shadow tree
55
33
11
2244a,da,d
b,cb,c
b,db,da,ca,c
24
Implementation – Leaf Implementation – Leaf CountCount
Leaf count of column Leaf count of column ii (L[ (L[ ii
]): the number of 2's plus ]): the number of 2's plus twice the number of 1's in twice the number of 1's in column column ii..
L[L[ ii ] is the number of ] is the number of leaves below mutation leaves below mutation ii, in , in everyevery perfect phylogeny perfect phylogeny for the genotype matrix.for the genotype matrix.
Along Along anyany path to the root path to the root in in anyany PPH solution, the PPH solution, the successive edges are successive edges are labeled by columns with labeled by columns with strictly increasing leaf strictly increasing leaf counts.counts.
11 22 33 44
aa 11 11 00 00
bb 00 22 22 00
cc 22 00 22 00
dd 22 00 00 22
4 3 2 1Leaf Count:
25
Time ComplexityTime Complexity Constant number of simple Constant number of simple
operations on each edge per rowoperations on each edge per row Each traversal in the shadow tree Each traversal in the shadow tree
goes through O(goes through O(mm) edges.) edges. The algorithm does constant The algorithm does constant
number of traversals in the number of traversals in the shadow tree for each row.shadow tree for each row.
Total time: O(Total time: O(nn mm))n, m are the number of rows and columns in the genotype matrix.
26
ResultsResults
Average Running Times (seconds)
Sites (m)
Individuals (n)
Dataset DPPH O(nm2) Our Alg. O(nm)
300 150 30 1.07 0.05
500 250 30 5.72 0.13
1000 500 30 45.85 0.48
2000 1000 10 467.18 1.89