using pq trees for comparative genomics - cpm 20051 using pq trees for comparative genomics gad m....
Post on 21-Dec-2015
220 views
TRANSCRIPT
Using PQ Trees For Comparative Genomics - CPM 2005 1
Using PQ Trees For Comparative Genomics
Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. WatsonOren Weimann – Univ. of Haifa
Using PQ Trees For Comparative Genomics - CPM 2005 2
Gene Clusters
Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same.
Genome 1
Genome 2
Genome 3
Genome 4
Genome 5
Using PQ Trees For Comparative Genomics - CPM 2005 3
What is a Pattern? [WABI04]
Given a string S=“s1s2s3….sn” and an integer K, a pattern P={p1,p2,p3,…,pm} is a pattern if P occurs (possibly permuted) in at least K places in S.
Example:S =a b c d b a c d a b a c b P = {a,b,c} K=4
P is a 4-Pattern with location-list = {1,5,10,11}
For the moment we will assume that every character appears once in the pattern.
Using PQ Trees For Comparative Genomics - CPM 2005 4
S = a b c d e b a d c e
Maximal Patterns
Maximal notation - a representation of a maximal pattern p that illustrates all the non-maximal patterns with respect to p.
Our goal: Find all patterns p and their maximal notation.Our solution – a linear time algorithm based on
PQ trees.
S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e}
The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)
Using PQ Trees For Comparative Genomics - CPM 2005 5
PQ trees: Booth, Lueker Definitions
PQ trees [Booth, Lueker, 1976] Character labeled leaves. P-nodes:
Represent “truly permuted” componentsArbitrary permutations of children
Q-nodes:Represent bi-connected componentsOnly “reversion”
B
E FG H
IJ K
BD
A C
D
Using PQ Trees For Comparative Genomics - CPM 2005 6
PQ trees: Definitions Equivalent PQ trees ( ).
'TT
E FG H
IJ K
BD
A C
E FG H
IJ K
B
D
A C
Using PQ Trees For Comparative Genomics - CPM 2005 7
PQ trees: Definitions
FRONTIER:
C(T)= the set of frontiers of all trees equivalent to T:
E FG H
IJ K
BD
A C FRONTIER(T)=“A B C D E F G H I J K”FRONTIER(T)=“A B C G H I J K E F D"
}'|)'({)( TTTFRONTIERTC
Theorem: If C(T1)=C(T2) then T1 T2.
Using PQ Trees For Comparative Genomics - CPM 2005 8
Our Use of the PQ tree
Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd , acbd , dbca , dcba }.
Our goal:
C(T) = { abcd , acbd , dbca , dcba }.
Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the Pattern {a,b,c,d}
b ca d
Using PQ Trees For Comparative Genomics - CPM 2005 9
The minimal Consensus PQ tree
It is not always possible to find a tree T where =C(T): Consider a Pattern {a,b,c,d} that appears as: = { abcd ,
bdac }.
{ abcd , bdac } C(T)
Given permutations ={1, 2,…,k}, the consensus PQ tree T of is such that C(T), and the consensus is minimal when there exists no other T’ such that C(T’) and |C(T’)| |C(T)|. The problem of obtaining a maximal notation for a Pattern
is the same as obtaining a minimal consensus PQ tree of all the k occurrences.
Theorem: The minimal consensus PQ tree T is unique.
b ca d
Using PQ Trees For Comparative Genomics - CPM 2005 10
The original use of the PQ Tree
The consecutive 1’s problem:
The restriction sets:
F = { {a,b,c} , {b,c} , {b,c,d} , {b} }
The solution [Booth, Lueker, 1976]:
Reduce(F )=
The result will be C(T), in our case C(T)={abcd , acbd , dbca , dcba}
and the tree was constructed in O( ) time (for an n x n matrix)
(Reduce(F) by [Booth, Lueker, 1976])
a
1
0
0
0
b
1
1
1
1
c
1
1
1
0
d
0
0
1
0
b c
a d
2n
Using PQ Trees For Comparative Genomics - CPM 2005 11
Obtaining the Minimal Consensus PQ tree
Some definitions [Heber, Stoye, 2001]: Common interval – an interval that appears as a consecutive
sequence in all the appearances. [4-8] in the example. We denote = all Common intervals =
{ [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] } A list p of common intervals is a chain if every two successive
intervals in p have a non-trivial overlap. For example P=([1-2],[2-3]) A common interval is called reducible if there is a chain that
generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2] ,[2-3]
We denote = all irreducible intervals of ={ [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] }
1 2 3 4 5 6 7 8 9
9 8 4 5 6 7 1 2 3
1 2 3 8 7 4 5 6 9
1
2
3
1 2 3 4 5 6 7 8 9
9 8 4 5 6 7 1 2 3
1 2 3 8 7 4 5 6 9
1
2
3
C
1 2 3 4 5 6 7 8 9
9 8 4 5 6 7 1 2 3
1 2 3 8 7 4 5 6 9
1
2
3
I
Using PQ Trees For Comparative Genomics - CPM 2005 12
Theorem: Reduce( ) = Reduce( ) = minimal consensus tree.
The Algorithm: Compute . { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } Compute Reduce( ) to get the minimal consensus tree of
.
The Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9)
Time Complexity: For a a pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation.
I
Obtaining the Minimal Consensus PQ tree
C
I
1 2 3 4 5 6 7 8 9
9 8 4 5 6 7 1 2 3
1 2 3 8 7 4 5 6 9
1
2
3
1 2 3
98
74 5 6
I
2n
Using PQ Trees For Comparative Genomics - CPM 2005 13
Improving the Time Complexity to O(kn)
In Heber & Stoye’s algorithm for obtaining , a data structure S was maintained to hold the chains of the irreducible intervals:
= { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],
[5,6] }
REPLACE(S): Replace every chain by a Q node. Replace every element that is not a leaf or a Q node and is
pointed by a vertical link with a P node.
I
1 2 3 4 5 6 7 8 9
9 8 4 5 6 7 1 2 3
1 2 3 8 7 4 5 6 9
1
2
31 2 3
98
74 5 6I
Using PQ Trees For Comparative Genomics - CPM 2005 14
Maximal Patterns and Sub-Trees
A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants.
Suppose the Pattern {a,b,c,d} appears in 4 locations as:
= { abcd , acbd , dbca , dcba }.
Theorem: If p1 and p2 are patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2.
b ca d
Using PQ Trees For Comparative Genomics - CPM 2005 15
So what did we achieve?
A first algorithm (and optimal in time) that generates the maximal notation of a pattern. A “bottom-up” construction of a PQ tree. A visualization of the inner structure of a pattern. Filtering of meaningful from apparently meaningless
(non-maximal) clusters. Experimental results that prove this tool can aid in
predicting gene functions. Clustering for the various genome models.
Using PQ Trees For Comparative Genomics - CPM 2005 16
Using Our Tool for Various Genome Models
Genome model I (orthologs only):
A sequence is a permutation of the set {1,2…,n}. Only one maximal pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.
Using PQ Trees For Comparative Genomics - CPM 2005 17
Using Our Tool for Various Genome Models
Genome model II : A gene may appear once in a sequence or not appear at all in that sequence.
We can extend the algorithm to work on sequences that are not permutations of the same set in : Example: consider the 2 sequences
1 2 3 4 5 6 7 and 1 8 2 4 3 7 6
8 1 2 3 4 5 5’ 6 7 8’ and 5 1 8 8’ 2 4 3 7 6 5’
add characters as needed:
Build PQ Tree on the new sequences:
8
3 4
1 2
8‘5 The sub-trees that have no red leaves
Are all the maximal patterns5’ 6 7
)( 2nkO
Using PQ Trees For Comparative Genomics - CPM 2005 18
Using Our Tool for Various Genome Models
Genome model III (paralogs and orthologs):
A gene may appear any number of times in a sequence (including zero).
The minimal consensus PQ tree is not necessarily unique. Solution: Example: consider 2 appearances of the pattern {a,a,b} as = { aab , baa }:
1. = { a1a2b , ba2a1 } C(T)= { a1a2b , ba2a1 }
2. = { a1a2b , ba1a2 } C(T)= { a1a2b , ba2a1 , a2a1b , ba1a2 }
a1 a2 b
ba1 a2
Using PQ Trees For Comparative Genomics - CPM 2005 19
It