"quadratic time algorithms for finding common intervals in two and more sequences" by t....

33
"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 3109, pp. 347-358 (2004). Presented by Gangman Yi

Post on 21-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

"Quadratic time algorithms for finding common intervals in two

and more sequences"

by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer

Science 3109, pp. 347-358 (2004).

Presented by Gangman Yi

Page 2: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Overview

Introduction Formal Model Algorithms Assignment

Page 3: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Gene Order & Function in Bacteria:

Observations: Gene order in bacterial genomes is weakly conserved Some genes tend to cluster together even in unrelated species Functional association of genes inside a cluster

?

Page 4: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n

Page 5: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n

1 2 3 4 5 6 7 8

π1

π2

π3

π4

Page 6: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n

1 2 3 4 5 6 7 8

π1

π2

π3

π4

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

Page 7: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n Gene cluster: common interval subset of numbers occurring

contiguously in all permutations)

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π1

π2

π3

π4

Page 8: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n Gene cluster: common interval subset of numbers occurring

contiguously in all permutations)

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π1

π2

π3

π4

Page 9: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n Gene cluster: common interval subset of numbers occurring

contiguously in all permutations)

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π1

π2

π3

π4

Page 10: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formalization of Gene Clusters:

Genomes: permutations π1, π2 ,…, πk Genes: numbers 1,…,n Gene cluster: common interval subset of numbers occurring

contiguously in all permutations)

Algorithms:

Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+|output|) time.

Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2 permutations in O(kn+|output|) time.

Page 11: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Modeling multiple copies of a gene (paralogs):

Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π1

π2

π3

7 ?

Page 12: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Modeling multiple copies of a gene (paralogs):

Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π1

π2

π3

? 7

Page 13: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Modeling multiple copies of a gene (paralogs):

Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π1

π2

π3

3 1 2 ? ?

Page 14: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Modeling multiple copies of a gene (paralogs):

Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π1

π2

π3

3 ? 2 1 ?

Page 15: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Modeling multiple copies of a gene (paralogs):

Solution: Do not distinguish between paralogous gene copies Each paralogous copy of a gene gets the same number

Consequence: Genomes are modeled as sequences instead of permutations

1 2 3 4 5 6 7 8

S1

S2

S3

3 1 2 4 8 7 6 1 2

8 7 6 7 5 4 2 1 3

Page 16: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formal Model:

Given: String S over a finite alphabet Σ

Notation: S[i] = the i-th character of S S[i,j] = substring of S starting at index i and ending

at j

Definition: The character set CS(S[i,j]) := {S[k] | i ≤ k ≤ j} is the set

of all characters occurring in the substring S[i,j].

Example:

CS(S[2,5]) := {1,2,3}

1 2 3 4 5 6 7 8

S : 3 1 2 3 1 5 2 6

Page 17: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formal Model:

Given: Subset C Σ

Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1] C

right-maximal = S[j+1] C maximal = both left- and right-maximal

Example:

The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not left-maximal !

S : 3 1 2 3 1 5 2 6 1 2 3 4 5 6 7 8

Page 18: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Formal Model:

Given: Collection of k strings S* = (S1,...,Sk) over alphabet Σ

Definition: C Σ is a common CS-factor of S* if and only if C has a CS-location in each Sl , 1 ≤ l ≤ k.

Example:

common CS-factor: {1,3,5} => S1: (3,7) ― S2: (2,6) ― S3: (2,5)

0 1 2 3 4 5 6 7

S1 : 3 2 1 3 1 5 1 6

S2 : 4 3 5 5 5 1 4 2 2

S3 : 7 5 1 5 3 6 5 1 2 3 4 5 6 7 8 9

Page 19: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Problem Formulation:

A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes.

Given a collection of k strings S*:

Problem 1: Find all common CS-factors in S*.

Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

Page 20: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Algorithm "Connecting Intervals" (CI)

Algorithm CI solves Problem 1 and Problem 2 for two sequences

Input: Two sequences of length up to n with characters drawn

from Σ = {1,...,m}, m ≤ 2n

Output: Pairs of CS-locations of all common CS-factors

Time & Space complexity: O(n²)

Page 21: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) : i j

POS[c] holds all positions where character c occurs in S1.

NUM(i,j) counts the number of unique characters in S1[i,j].

Compute two tables for S1= (3,1,2,3,1,5,2,6)

Preprocessing

Page 22: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 21 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

ji

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

NUM(i,j) :i

j

Algorithm CI

Page 23: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 21 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

ji

NUM(i,j) :i

jPOS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

Algorithm CI

Page 24: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

S2 : 4 3 5 5 5 1 4 2 21 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

ji

NUM(i,j) :i

j

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

Algorithm CI

Page 25: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

1 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) :i

j

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S2 : 4 3 5 5 5 1 4 2 2

j

Algorithm CI

Page 26: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

1 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) :i

j

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S2 : 4 3 5 5 5 1 4 2 2

j

Algorithm CI

Page 27: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

1 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) :i

j

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S2 : 4 3 5 5 5 1 4 2 2

j

Algorithm CI

Page 28: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

1 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) :i

j

Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6))

i

S2 : 4 3 5 5 5 1 4 2 2

j

Algorithm CI

Page 29: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

1 2 3 4 5 6 7 81 1 2 3 3 3 4 4 52 1 2 3 3 4 4 53 1 2 3 4 4 54 1 2 3 4 55 1 2 3 46 1 2 37 1 28 1

Algorithm: While reading S2, mark in S1 the observed character and track maximal intervals of marked characters

1 2 3 4 5 6 7 8

S1 : 3 1 2 3 1 5 2 6

POS[1] = 2,5POS[2] = 3,7POS[3] = 1,4POS[4] = emptyPOS[5] = 6POS[6] = 8

NUM(i,j) :i

j

i

S2 : 4 3 5 5 5 1 4 2 2

j

Algorithm CI

Page 30: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Time Complexity

Algorithm CI finds all common CS-factors of S1 and S2 in O(n²) time.

1. for i = 1,...,|S2| do2. j = i3. while j < |S2| and (i,j) is maximal do4. if (c = S2[j]) is seen the first time 5. for each entry in POS(c) do6. mark and track7. end for8. end if9. j = j + 110. end while11. end for

Page 31: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Multiple GenomesGoal : Find all common CS-factors of a collection S*=(S1,S2,...,Sk)

Algorithm : Apply Algorithm CI to all pairs (S1,Sl), 2 ≤ l ≤ kOutput only the common CS-factor detected in all pairs

Time complexity : O(kn²)

Space complexity : O(kn²) with redundant output, O(n²) otherwise

Further extension : Find all common CS-factors appearing in at least k' of k strings of S*

Time complexity : O(k(1+k-k')n²)

Saving space : Due to the storage of the table NUM, Algorithm CI requires quadratic space.

Page 32: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Assignment Make a clustering algorithm.

Each sequence S has n unique genes, but the same gene can be in the other sequences. The number of sequences are k. Maximum output size for the cluster has to be m, so each cluster can have at most m genes. Do not consider about the order of genes in each cluster.

S1

S2

S3

Sk

n

ABDCBCDAADCBBCAD

Max. size for the cluster, m = 4

Output Example

EFFEEFFE

Page 33: "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial

Gangman Yi

Email : [email protected]

THANK YOU