using pq trees for comparative genomics - cpm 20051 using pq trees for comparative genomics gad m....

Using PQ Trees For Comparative Genomics - CPM 2005 1

Using PQ Trees For Comparative Genomics

Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. WatsonOren Weimann – Univ. of Haifa


Gene Clusters

Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same.

Genome 1

Genome 2

Genome 3

Genome 4

Genome 5


What is a Pattern? [WABI04]

Given a string S=“s1s2s3….sn” and an integer K, a pattern P={p1,p2,p3,…,pm} is a pattern if P occurs (possibly permuted) in at least K places in S.

Example:S =a b c d b a c d a b a c b P = {a,b,c} K=4

P is a 4-Pattern with location-list = {1,5,10,11}

For the moment we will assume that every character appears once in the pattern.


S = a b c d e b a d c e

Maximal Patterns

Maximal notation - a representation of a maximal pattern p that illustrates all the non-maximal patterns with respect to p.

Our goal: Find all patterns p and their maximal notation.Our solution – a linear time algorithm based on

PQ trees.

S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e}

The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)


PQ trees: Booth, Lueker Definitions

PQ trees [Booth, Lueker, 1976] Character labeled leaves. P-nodes:

Represent “truly permuted” componentsArbitrary permutations of children

Q-nodes:Represent bi-connected componentsOnly “reversion”

B

E FG H

IJ K

BD

A C

D


PQ trees: Definitions Equivalent PQ trees ( ).

'TT

E FG H

IJ K

BD

A C

E FG H

IJ K

B

D

A C


PQ trees: Definitions

FRONTIER:

C(T)= the set of frontiers of all trees equivalent to T:

E FG H

IJ K

BD

A C FRONTIER(T)=“A B C D E F G H I J K”FRONTIER(T)=“A B C G H I J K E F D"

}'|)'({)( TTTFRONTIERTC

Theorem: If C(T1)=C(T2) then T1 T2.


Our Use of the PQ tree

Suppose the Pattern {a,b,c,d} appears in 4 locations as: = { abcd , acbd , dbca , dcba }.

Our goal:

C(T) = { abcd , acbd , dbca , dcba }.

Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the Pattern {a,b,c,d}

b ca d


The minimal Consensus PQ tree

It is not always possible to find a tree T where =C(T): Consider a Pattern {a,b,c,d} that appears as: = { abcd ,

bdac }.

{ abcd , bdac } C(T)

Given permutations ={1, 2,…,k}, the consensus PQ tree T of is such that C(T), and the consensus is minimal when there exists no other T’ such that C(T’) and |C(T’)| |C(T)|. The problem of obtaining a maximal notation for a Pattern

is the same as obtaining a minimal consensus PQ tree of all the k occurrences.

Theorem: The minimal consensus PQ tree T is unique.

b ca d


The original use of the PQ Tree

The consecutive 1’s problem:

The restriction sets:

F = { {a,b,c} , {b,c} , {b,c,d} , {b} }

The solution [Booth, Lueker, 1976]:

Reduce(F )=

The result will be C(T), in our case C(T)={abcd , acbd , dbca , dcba}

and the tree was constructed in O( ) time (for an n x n matrix)

(Reduce(F) by [Booth, Lueker, 1976])

a

1

0

0

0

b

1

1

1

1

c

1

1

1

0

d

0

0

1

0

b c

a d

2n


Obtaining the Minimal Consensus PQ tree

Some definitions [Heber, Stoye, 2001]: Common interval – an interval that appears as a consecutive

sequence in all the appearances. [4-8] in the example. We denote = all Common intervals =

{ [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] } A list p of common intervals is a chain if every two successive

intervals in p have a non-trivial overlap. For example P=([1-2],[2-3]) A common interval is called reducible if there is a chain that

generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2] ,[2-3]

We denote = all irreducible intervals of ={ [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] }

1 2 3 4 5 6 7 8 9

9 8 4 5 6 7 1 2 3

1 2 3 8 7 4 5 6 9

1

2

3

1 2 3 4 5 6 7 8 9

9 8 4 5 6 7 1 2 3

1 2 3 8 7 4 5 6 9

1

2

3

C

1 2 3 4 5 6 7 8 9

9 8 4 5 6 7 1 2 3

1 2 3 8 7 4 5 6 9

1

2

3

I


Theorem: Reduce( ) = Reduce( ) = minimal consensus tree.

The Algorithm: Compute . { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } Compute Reduce( ) to get the minimal consensus tree of

.

The Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9)

Time Complexity: For a a pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation.

I

Obtaining the Minimal Consensus PQ tree

C

I

1 2 3 4 5 6 7 8 9

9 8 4 5 6 7 1 2 3

1 2 3 8 7 4 5 6 9

1

2

3

1 2 3

98

74 5 6

I

2n


Improving the Time Complexity to O(kn)

In Heber & Stoye’s algorithm for obtaining , a data structure S was maintained to hold the chains of the irreducible intervals:

= { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],

[5,6] }

REPLACE(S): Replace every chain by a Q node. Replace every element that is not a leaf or a Q node and is

pointed by a vertical link with a P node.

I

1 2 3 4 5 6 7 8 9

9 8 4 5 6 7 1 2 3

1 2 3 8 7 4 5 6 9

1

2

31 2 3

98

74 5 6I


Maximal Patterns and Sub-Trees

A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants.

Suppose the Pattern {a,b,c,d} appears in 4 locations as:

= { abcd , acbd , dbca , dcba }.

Theorem: If p1 and p2 are patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2.

b ca d


So what did we achieve?

A first algorithm (and optimal in time) that generates the maximal notation of a pattern. A “bottom-up” construction of a PQ tree. A visualization of the inner structure of a pattern. Filtering of meaningful from apparently meaningless

(non-maximal) clusters. Experimental results that prove this tool can aid in

predicting gene functions. Clustering for the various genome models.


Using Our Tool for Various Genome Models

Genome model I (orthologs only):

A sequence is a permutation of the set {1,2…,n}. Only one maximal pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.



Genome model II : A gene may appear once in a sequence or not appear at all in that sequence.

We can extend the algorithm to work on sequences that are not permutations of the same set in : Example: consider the 2 sequences

1 2 3 4 5 6 7 and 1 8 2 4 3 7 6

8 1 2 3 4 5 5’ 6 7 8’ and 5 1 8 8’ 2 4 3 7 6 5’

add characters as needed:

Build PQ Tree on the new sequences:

8

3 4

1 2

8‘5 The sub-trees that have no red leaves

Are all the maximal patterns5’ 6 7

)( 2nkO



Genome model III (paralogs and orthologs):

A gene may appear any number of times in a sequence (including zero).

The minimal consensus PQ tree is not necessarily unique. Solution: Example: consider 2 appearances of the pattern {a,a,b} as = { aab , baa }:

1. = { a1a2b , ba2a1 } C(T)= { a1a2b , ba2a1 }

2. = { a1a2b , ba1a2 } C(T)= { a1a2b , ba2a1 , a2a1b , ba1a2 }

a1 a2 b

ba1 a2


It

using pq trees for comparative genomics - cpm 20051 using pq trees for comparative genomics gad m....

Documents