a superglue for string comparison

362
A superglue for string comparison Alexander Tiskin Department of Computer Science, University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick) Semi-local LCS 1 / 164

Upload: bioinformaticsinstitute

Post on 20-Feb-2017

146 views

Category:

Science


3 download

TRANSCRIPT

Page 1: A superglue for string comparison

A superglue for string comparison

Alexander Tiskin

Department of Computer Science, University of Warwickhttp://go.warwick.ac.uk/alextiskin

Alexander Tiskin (Warwick) Semi-local LCS 1 / 164

Page 2: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 2 / 164

Page 3: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 3 / 164

Page 4: A superglue for string comparison

Introduction

String matching: finding an exact pattern in a string

String comparison: finding similar patterns in two strings

global: whole string a vs whole string b

semi-local: whole string a vs substrings in b (approximate matching);prefixes in a vs suffixes in b

local: substrings in a vs substrings in b

We focus on semi-local comparison:

fundamentally important for both global and local comparison

exciting mathematical properties

Alexander Tiskin (Warwick) Semi-local LCS 4 / 164

Page 5: A superglue for string comparison

Introduction

String matching: finding an exact pattern in a string

String comparison: finding similar patterns in two strings

global: whole string a vs whole string b

semi-local: whole string a vs substrings in b (approximate matching);prefixes in a vs suffixes in b

local: substrings in a vs substrings in b

We focus on semi-local comparison:

fundamentally important for both global and local comparison

exciting mathematical properties

Alexander Tiskin (Warwick) Semi-local LCS 4 / 164

Page 6: A superglue for string comparison

Introduction

String matching: finding an exact pattern in a string

String comparison: finding similar patterns in two strings

global: whole string a vs whole string b

semi-local: whole string a vs substrings in b (approximate matching);prefixes in a vs suffixes in b

local: substrings in a vs substrings in b

We focus on semi-local comparison:

fundamentally important for both global and local comparison

exciting mathematical properties

Alexander Tiskin (Warwick) Semi-local LCS 4 / 164

Page 7: A superglue for string comparison

IntroductionOverview

Standard approach to string comparison: dynamic programming

Our approach: the algebra of seaweed matrices/braids

Can be used either for dynamic programming, or for divide-and-conquer

Divide-and-conquer is more efficient for:

comparing dynamic strings (truncation, concatenation)

comparing compressed strings (e.g. LZ-compression)

comparing strings in parallel

The “conquer” step involves a magic “superglue” (efficient multiplicationof seaweed matrices/braids)

Alexander Tiskin (Warwick) Semi-local LCS 5 / 164

Page 8: A superglue for string comparison

IntroductionTerminology and notation

The “squared paper” notation:

Integers {. . .− 2,−1, 0, 1, 2, . . .} x− = x − 12 x+ = x + 1

2

Half-integers{. . .− 3

2 ,−12 ,

12 ,

32 ,

52 , . . .

}={. . . (−2)+, (−1)+, 0+, 1+, 2+

}Planar dominance:

(i , j)� (i ′, j ′) iff i < i ′ and j < j ′ (the “above-left” relation)

(i , j) ≷ (i ′, j ′) iff i > i ′ and j < j ′ (the “below-left” relation)

where “above/below” follow matrix convention (not the Cartesian one!)

Alexander Tiskin (Warwick) Semi-local LCS 6 / 164

Page 9: A superglue for string comparison

IntroductionTerminology and notation

A permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column0 1 0

1 0 00 0 1

Alexander Tiskin (Warwick) Semi-local LCS 7 / 164

Page 10: A superglue for string comparison

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:DΣ(i , j) =

∑ı>i ,<j D(i , j)

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local LCS 8 / 164

Page 11: A superglue for string comparison

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:DΣ(i , j) =

∑ı>i ,<j D(i , j)

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers

0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local LCS 8 / 164

Page 12: A superglue for string comparison

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:DΣ(i , j) =

∑ı>i ,<j D(i , j)

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local LCS 8 / 164

Page 13: A superglue for string comparison

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:DΣ(i , j) =

∑ı>i ,<j D(i , j)

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local LCS 8 / 164

Page 14: A superglue for string comparison

IntroductionTerminology and notation

Matrix E is Monge, if E� is nonnegative

Intuition: boundary-to-boundary distances in a (weighted) planar graph

G. Monge (1746–1818)

G (planar)

i

j

0

1 2 3

4

0

1 2 3

4

Alexander Tiskin (Warwick) Semi-local LCS 9 / 164

Page 15: A superglue for string comparison

IntroductionTerminology and notation

Matrix E is Monge, if E� is nonnegative

Intuition: boundary-to-boundary distances in a (weighted) planar graph

G. Monge (1746–1818)

i

j

0

1 2 3

4

0

1 2 3

4

blue + blue ≤ red + red

Alexander Tiskin (Warwick) Semi-local LCS 10 / 164

Page 16: A superglue for string comparison

IntroductionTerminology and notation

Matrix E is unit-Monge, if E� is a permutation matrix

Intuition: boundary-to-boundary distances in a grid-like graph (inparticular, an alignment graph for a pair of strings)

Simple unit-Monge matrix (sometimes called “rank function”): PΣ, whereP is a permutation matrix

Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

Alexander Tiskin (Warwick) Semi-local LCS 11 / 164

Page 17: A superglue for string comparison

IntroductionTerminology and notation

Matrix E is unit-Monge, if E� is a permutation matrix

Intuition: boundary-to-boundary distances in a grid-like graph (inparticular, an alignment graph for a pair of strings)

Simple unit-Monge matrix (sometimes called “rank function”): PΣ, whereP is a permutation matrix

Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

Alexander Tiskin (Warwick) Semi-local LCS 11 / 164

Page 18: A superglue for string comparison

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: range tree on nonzeros of P [Bentley: 1980]

binary search tree by i-coordinate

under every node, binary search tree by j-coordinate

••••−→ •

•••−→ •

•••

••••−→ •

•••−→ •

•••

••••−→ •

•••−→ •

•••

Alexander Tiskin (Warwick) Semi-local LCS 12 / 164

Page 19: A superglue for string comparison

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: (contd.)

Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count

Overall, ≤ n log n canonical ranges are non-empty

A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?

Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges

Total size O(n log n), query time O(log2 n)

There are asymptotically more efficient (but less practical) data structures

Total size O(n), query time O( log n

log log n

)[JaJa+: 2004]

[Chan, Patrascu: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 13 / 164

Page 20: A superglue for string comparison

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: (contd.)

Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count

Overall, ≤ n log n canonical ranges are non-empty

A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?

Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges

Total size O(n log n), query time O(log2 n)

There are asymptotically more efficient (but less practical) data structures

Total size O(n), query time O( log n

log log n

)[JaJa+: 2004]

[Chan, Patrascu: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 13 / 164

Page 21: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 14 / 164

Page 22: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Distance semiring (or (min,+)-semiring, special case of tropical algebra):

addition ⊕ given by min, multiplication � given by +

Matrix �-multiplication

A� B = C C (i , k) =⊕

j

(A(i , j)� B(j , k)

)= minj

(A(i , j) + B(j , k)

)Intuition: gluing distances in a composition of planar graphs

i

G ′

j

G ′′

k

Alexander Tiskin (Warwick) Semi-local LCS 15 / 164

Page 23: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Matrix classes closed under �-multiplication (for given n):

general (integer, real) matrices ∼ general weighted graphs

Monge matrices ∼ planar weighted graphs

simple unit-Monge matrices ∼ grid-like graphs

Implicit �-multiplication: define P � Q = R as PΣ � QΣ = RΣ

The seaweed monoid Tn:

simple unit-Monge matrices under �equivalently, permutation (seaweed) matrices under �

Isomorphic to the 0-Hecke monoid of the symmetric group H0(Sn)

Alexander Tiskin (Warwick) Semi-local LCS 16 / 164

Page 24: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

P � Q = R can be seen as combing of seaweed braids

P

Q

=

R

P

Q

= = R

Alexander Tiskin (Warwick) Semi-local LCS 17 / 164

Page 25: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

P � Q = R can be seen as combing of seaweed braids

P

Q

=

R

P

Q

=

= R

Alexander Tiskin (Warwick) Semi-local LCS 17 / 164

Page 26: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

P � Q = R can be seen as combing of seaweed braids

P

Q

=

R

P

Q

= =

R

Alexander Tiskin (Warwick) Semi-local LCS 17 / 164

Page 27: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

P � Q = R can be seen as combing of seaweed braids

P

Q

=

R

P

Q

= = R

Alexander Tiskin (Warwick) Semi-local LCS 17 / 164

Page 28: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

The seaweed monoid Tn:

n! elements (permutations of size n)

n − 1 generators g1, g2, . . . , gn−1 (elementary crossings)

Idempotence:g2i = gi for all i =

Far commutativity:gigj = gjgi j − i > 1 · · · = · · ·

Braid relations:gigjgi = gjgigj j − i = 1 =

Alexander Tiskin (Warwick) Semi-local LCS 18 / 164

Page 29: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Special elements of the seaweed monoid Tn

Identity: 1� x = x

1 = =

• · · ·· • · ·· · • ·· · · •

Zero: 0� x = 0

0 = =

· · · •· · • ·· • · ·• · · ·

Alexander Tiskin (Warwick) Semi-local LCS 19 / 164

Page 30: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Seaweed monoid: g2i = gi ; far comm; braid

Related structures:

classical braid group: gig−1i = 1; far comm; braid

Sn (Coxeter presentation): g2i = 1; far comm; braid

positive braid monoid: far comm; braid

locally free idempotent monoid: g2i = gi ; far comm

nil-Hecke monoid: g2i = 0; far comm; braid

Generalisations:

0-Hecke monoid of a Coxeter group

Hecke–Kiselman monoid

J -trivial monoid

Alexander Tiskin (Warwick) Semi-local LCS 20 / 164

Page 31: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP, Sage)

T3: 1, a = g1, b = g2; ab, ba; aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba; abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use as a “black box”, but not efficient

Alexander Tiskin (Warwick) Semi-local LCS 21 / 164

Page 32: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP, Sage)

T3: 1, a = g1, b = g2; ab, ba; aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba; abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use as a “black box”, but not efficient

Alexander Tiskin (Warwick) Semi-local LCS 21 / 164

Page 33: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP, Sage)

T3: 1, a = g1, b = g2; ab, ba; aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba; abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use as a “black box”, but not efficient

Alexander Tiskin (Warwick) Semi-local LCS 21 / 164

Page 34: A superglue for string comparison

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP, Sage)

T3: 1, a = g1, b = g2; ab, ba; aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba; abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use as a “black box”, but not efficient

Alexander Tiskin (Warwick) Semi-local LCS 21 / 164

Page 35: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Seaweed matrix �-multiplication

Given permutation matrices P, Q, compute R, such that PΣ � QΣ = RΣ

(equivalently, P � Q = R)

Seaweed matrix �-multiplication: running time

type time

general O(n3) standard

O(n3(log log n)3

log2 n

)[Chan: 2007]

Monge O(n2) via [Aggarwal+: 1987]

implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 22 / 164

Page 36: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Seaweed matrix �-multiplication

Given permutation matrices P, Q, compute R, such that PΣ � QΣ = RΣ

(equivalently, P � Q = R)

Seaweed matrix �-multiplication: running time

type time

general O(n3) standard

O(n3(log log n)3

log2 n

)[Chan: 2007]

Monge O(n2) via [Aggarwal+: 1987]

implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 22 / 164

Page 37: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

P

Q

R

?

Alexander Tiskin (Warwick) Semi-local LCS 23 / 164

Page 38: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

Alexander Tiskin (Warwick) Semi-local LCS 24 / 164

Page 39: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

Alexander Tiskin (Warwick) Semi-local LCS 24 / 164

Page 40: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

Alexander Tiskin (Warwick) Semi-local LCS 24 / 164

Page 41: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

Alexander Tiskin (Warwick) Semi-local LCS 24 / 164

Page 42: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

R

Alexander Tiskin (Warwick) Semi-local LCS 25 / 164

Page 43: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Plo , Phi

Qlo , Qhi

Rlo + Rhi

R

Alexander Tiskin (Warwick) Semi-local LCS 25 / 164

Page 44: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

P

Q

R

Alexander Tiskin (Warwick) Semi-local LCS 26 / 164

Page 45: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

P

Q

R

Alexander Tiskin (Warwick) Semi-local LCS 26 / 164

Page 46: A superglue for string comparison

Matrix distance multiplicationSeaweed matrix multiplication

Seaweed matrix �-multiplication: Steady Ant algorithm

RΣ(i , k) = minj(PΣ(i , j) + QΣ(j , k)

)Divide-and-conquer on the range of j : two subproblems of size n/2

PΣlo � QΣ

lo = RΣlo PΣ

hi � QΣhi = RΣ

hi

Conquer: tracing a balanced trail from bottom-left to top-right R

Trail invariant: equal number of �-dominated nonzeros in PC ,hi and�-dominating nonzeros in PC ,lo

All such nonzeros are “errors” in the output, must be removed

Must compensate by placing nonzeros on the path, time O(n)

Overall time O(n log n)

Alexander Tiskin (Warwick) Semi-local LCS 27 / 164

Page 47: A superglue for string comparison

Matrix distance multiplicationBruhat order

Bruhat order

Permutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A � B), if B A by successive pairwise sorting (equivalently,A B by anti-sorting) of arbitrary pairs

Permutation matrices: P � Q, if Q P by successive 2× 2 submatrixsorting: ( 0 1

1 0 )→ ( 1 00 1 )

Plays an important role in group theory and algebraic geometry (inclusionorder of Schubert varieties)

Describes pivoting order in Gaussian elimination (matrix Bruhatdecomposition)

Alexander Tiskin (Warwick) Semi-local LCS 28 / 164

Page 48: A superglue for string comparison

Matrix distance multiplicationBruhat order

Bruhat order

Permutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A � B), if B A by successive pairwise sorting (equivalently,A B by anti-sorting) of arbitrary pairs

Permutation matrices: P � Q, if Q P by successive 2× 2 submatrixsorting: ( 0 1

1 0 )→ ( 1 00 1 )

Plays an important role in group theory and algebraic geometry (inclusionorder of Schubert varieties)

Describes pivoting order in Gaussian elimination (matrix Bruhatdecomposition)

Alexander Tiskin (Warwick) Semi-local LCS 28 / 164

Page 49: A superglue for string comparison

Matrix distance multiplicationBruhat order

Bruhat comparability: running time

O(n2) [Ehresmann: 1934; Proctor: 1982; Grigoriev: 1982]O(n log n) [T: 2013]

O( n log n

log log n

)[Gawrychowski: NEW]

Alexander Tiskin (Warwick) Semi-local LCS 29 / 164

Page 50: A superglue for string comparison

Matrix distance multiplicationBruhat order

Ehresmann’s criterion (dot criterion, related to tableau criterion)

P � Q iff PΣ ≤ QΣ elementwise1 0 00 0 10 1 0

Σ

=

0 1 2 30 0 1 20 0 1 10 0 0 0

0 1 2 30 1 2 20 0 1 10 0 0 0

=

0 0 11 0 00 1 0

Σ

1 0 00 0 10 1 0

Σ

=

0 1 2 30 0 1 20 0 1 10 0 0 0

6≤

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

Σ

Time O(n2)

Alexander Tiskin (Warwick) Semi-local LCS 30 / 164

Page 51: A superglue for string comparison

Matrix distance multiplicationBruhat order

Seaweed criterion

P � Q iff PR � Q = IdR , where PR = counterclockwise rotation of P

Intuition: permutations represented by seaweed braids

P � Q, iff

no pair of seaweeds is crossed in P, while the “corresponding” pair isuncrossed in Q

equivalently, no pair is uncrossed in PR , while the “corresponding”pair is uncrossed in Q

equivalently, PR � Q = IdR

Time O(n log n) by seaweed matrix multiplication

Alexander Tiskin (Warwick) Semi-local LCS 31 / 164

Page 52: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

�0 0 1

1 0 00 1 0

1 0 0

0 0 10 1 0

R

0 0 11 0 00 1 0

=

0 1 00 0 11 0 0

�0 0 1

1 0 00 1 0

=

0 0 10 1 01 0 0

= IdR

P

Q

PR

Q

= IdR

Alexander Tiskin (Warwick) Semi-local LCS 32 / 164

Page 53: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

�0 0 1

1 0 00 1 0

1 0 0

0 0 10 1 0

R

0 0 11 0 00 1 0

=

0 1 00 0 11 0 0

�0 0 1

1 0 00 1 0

=

0 0 10 1 01 0 0

= IdR

P Q

PR

Q

= IdR

Alexander Tiskin (Warwick) Semi-local LCS 32 / 164

Page 54: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

�0 0 1

1 0 00 1 0

1 0 0

0 0 10 1 0

R

0 0 11 0 00 1 0

=

0 1 00 0 11 0 0

�0 0 1

1 0 00 1 0

=

0 0 10 1 01 0 0

= IdR

P Q

PR

Q

=

IdR

Alexander Tiskin (Warwick) Semi-local LCS 32 / 164

Page 55: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

�0 0 1

1 0 00 1 0

1 0 0

0 0 10 1 0

R

0 0 11 0 00 1 0

=

0 1 00 0 11 0 0

�0 0 1

1 0 00 1 0

=

0 0 10 1 01 0 0

= IdR

P Q

PR

Q

= IdR

Alexander Tiskin (Warwick) Semi-local LCS 32 / 164

Page 56: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

6�0 1 0

1 0 00 0 1

1 0 0

0 0 10 1 0

R

0 1 01 0 00 0 1

=

0 1 00 0 11 0 0

�0 1 0

1 0 00 0 1

=

0 1 00 0 11 0 0

6= IdR

P

Q

PR

Q

= 6= IdR

Alexander Tiskin (Warwick) Semi-local LCS 33 / 164

Page 57: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

6�0 1 0

1 0 00 0 1

1 0 0

0 0 10 1 0

R

0 1 01 0 00 0 1

=

0 1 00 0 11 0 0

�0 1 0

1 0 00 0 1

=

0 1 00 0 11 0 0

6= IdR

P Q

PR

Q

= 6= IdR

Alexander Tiskin (Warwick) Semi-local LCS 33 / 164

Page 58: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

6�0 1 0

1 0 00 0 1

1 0 0

0 0 10 1 0

R

0 1 01 0 00 0 1

=

0 1 00 0 11 0 0

�0 1 0

1 0 00 0 1

=

0 1 00 0 11 0 0

6= IdR

P Q

PR

Q

=

6= IdR

Alexander Tiskin (Warwick) Semi-local LCS 33 / 164

Page 59: A superglue for string comparison

Matrix distance multiplicationBruhat order1 0 0

0 0 10 1 0

6�0 1 0

1 0 00 0 1

1 0 0

0 0 10 1 0

R

0 1 01 0 00 0 1

=

0 1 00 0 11 0 0

�0 1 0

1 0 00 0 1

=

0 1 00 0 11 0 0

6= IdR

P Q

PR

Q

= 6= IdR

Alexander Tiskin (Warwick) Semi-local LCS 33 / 164

Page 60: A superglue for string comparison

Matrix distance multiplicationBruhat order

Alternative solution: clever implementation of Ehresmann’s criterion[Gawrychowski: 2013]

The online partial sums problem: maintain array X [1 : n], subject to

update(k,∆): X [k]← X [k] + ∆

prefixsum(k): return∑

1≤i≤k X [i ]

Query time:

Θ(log n) in semigroup or group model

Θ( log n

log log n

)in RAM model on integers [Patrascu, Demaine: 2004]

Gives Bruhat comparability in time O( n log n

log log n

)in RAM model

Open problem: seaweed multiplication in time O( n log n

log log n

)?

Alexander Tiskin (Warwick) Semi-local LCS 34 / 164

Page 61: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 35 / 164

Page 62: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

Consider strings (= sequences) over an alphabet of size σ

Contiguous substrings vs not necessarily contiguous subsequences

Special cases of substring: prefix, suffix

Notation: strings a, b of length m, n respectively

Assume where necessary: m ≤ n; m, n reasonably close

The longest common subsequence (LCS) score:

length of longest string that is a subsequence of both a and b

equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0

In biological terms, “loss-free alignment” (unlike efficient but “lossy”BLAST)

Alexander Tiskin (Warwick) Semi-local LCS 36 / 164

Page 63: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

Consider strings (= sequences) over an alphabet of size σ

Contiguous substrings vs not necessarily contiguous subsequences

Special cases of substring: prefix, suffix

Notation: strings a, b of length m, n respectively

Assume where necessary: m ≤ n; m, n reasonably close

The longest common subsequence (LCS) score:

length of longest string that is a subsequence of both a and b

equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0

In biological terms, “loss-free alignment” (unlike efficient but “lossy”BLAST)

Alexander Tiskin (Warwick) Semi-local LCS 36 / 164

Page 64: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

The LCS problem

Give the LCS score for a vs b

LCS: running time

O(mn) [Wagner, Fischer: 1974]O(

mnlog2 n

)σ = O(1) [Masek, Paterson: 1980]

[Crochemore+: 2003]

O(mn(log log n)2

log2 n

)[Paterson, Dancık: 1994]

[Bille, Farach-Colton: 2008]

Running time varies depending on the RAM model version

We assume word-RAM with word size log n (where it matters)

Alexander Tiskin (Warwick) Semi-local LCS 37 / 164

Page 65: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

The LCS problem

Give the LCS score for a vs b

LCS: running time

O(mn) [Wagner, Fischer: 1974]O(

mnlog2 n

)σ = O(1) [Masek, Paterson: 1980]

[Crochemore+: 2003]

O(mn(log log n)2

log2 n

)[Paterson, Dancık: 1994]

[Bille, Farach-Colton: 2008]

Running time varies depending on the RAM model version

We assume word-RAM with word size log n (where it matters)

Alexander Tiskin (Warwick) Semi-local LCS 37 / 164

Page 66: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0

0 0 0 0 0 0

D 0

1 1 1 1 1 1

E 0

1 2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 67: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0

1 1 1 1 1 1

E 0

1 2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 68: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1

1 1 1 1 1

E 0

1 2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 69: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1

1 1 1 1

E 0

1 2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 70: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0

1 2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 71: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1

2 2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 72: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2

2 2 2 2

S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 73: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0

1 2 2 2 2 2

I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 74: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0 1 2 2 2 2 2I 0

1 2 2 3 3 3

G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 75: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0 1 2 2 2 2 2I 0 1 2 2 3 3 3G 0

1 2 2 3 3 3

N 0

1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 76: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0 1 2 2 2 2 2I 0 1 2 2 3 3 3G 0 1 2 2 3 3 3N 0 1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 77: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0 1 2 2 2 2 2I 0 1 2 2 3 3 3G 0 1 2 2 3 3 3N 0 1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 78: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS computation by dynamic programming

lcs(a, ∅) = 0

lcs(∅, b) = 0lcs(aα, bβ) =

{max(lcs(aα, b), lcs(a, bβ)) if α 6= β

lcs(a, b) + 1 if α = β

∗ D E F I N E

∗ 0 0 0 0 0 0 0D 0 1 1 1 1 1 1E 0 1 2 2 2 2 2S 0 1 2 2 2 2 2I 0 1 2 2 3 3 3G 0 1 2 2 3 3 3N 0 1 2 2 3 4 4

lcs(“DEFINE”, “DESIGN”) = 4

LCS(a, b) can be traced back throughthe dynamic programming table at noextra asymptotic time cost

Alexander Tiskin (Warwick) Semi-local LCS 38 / 164

Page 79: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS on the alignment graph: directed grid of match and mismatch cells

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A blue = 0red = 1

score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8

LCS = highest-score path from top-left to bottom-right

Alexander Tiskin (Warwick) Semi-local LCS 39 / 164

Page 80: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS: dynamic programming [WF: 1974]

Sweep cells in any �-compatible order

Cell update: time O(1)

Overall time O(mn)

Alexander Tiskin (Warwick) Semi-local LCS 40 / 164

Page 81: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

LCS: micro-block dynamic programming [MP: 1980; BF: 2008]

Sweep cells in micro-blocks, in any �-compatible order

Micro-block size:

t = O(log n) when σ = O(1)

t = O( log n

log log n

)otherwise

Micro-block interface:

O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits

O(t) small integers, each O(1) bits

Micro-block update: time O(1), by precomputing all possible interfaces

Overall time O(

mnlog2 n

)when σ = O(1), O

(mn(log log n)2

log2 n

)otherwise

Alexander Tiskin (Warwick) Semi-local LCS 41 / 164

Page 82: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

‘Begin at the beginning,’ the King said gravely, ‘andgo on till you come to the end: then stop.’

L. Carroll, Alice in Wonderland

Dynamic programming: begins at empty strings,proceeds by appending characters, then stops

What about:

prepending/deleting characters (dynamic LCS)

concatenating strings (LCS on compressedstrings; parallel LCS)

taking substrings (= local alignment)

Alexander Tiskin (Warwick) Semi-local LCS 42 / 164

Page 83: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

‘Begin at the beginning,’ the King said gravely, ‘andgo on till you come to the end: then stop.’

L. Carroll, Alice in Wonderland

Dynamic programming: begins at empty strings,proceeds by appending characters, then stops

What about:

prepending/deleting characters (dynamic LCS)

concatenating strings (LCS on compressedstrings; parallel LCS)

taking substrings (= local alignment)

Alexander Tiskin (Warwick) Semi-local LCS 42 / 164

Page 84: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

Dynamic programming from both ends:better by ×2, but still not good enough

Is dynamic programming strictly necessary tosolve sequence alignment problems?

Eppstein+, Efficient algorithms for sequence

analysis, 1991

Alexander Tiskin (Warwick) Semi-local LCS 43 / 164

Page 85: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

The semi-local LCS problem

Give the (implicit) matrix of O((m + n)2

)LCS scores:

string-substring LCS: string a vs every substring of b

prefix-suffix LCS: every prefix of a vs every suffix of b

suffix-prefix LCS: every suffix of a vs every prefix of b

substring-string LCS: every substring of a vs string b

Alexander Tiskin (Warwick) Semi-local LCS 44 / 164

Page 86: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

The semi-local LCS problem

Give the (implicit) matrix of O((m + n)2

)LCS scores:

string-substring LCS: string a vs every substring of b

prefix-suffix LCS: every prefix of a vs every suffix of b

suffix-prefix LCS: every suffix of a vs every prefix of b

substring-string LCS: every substring of a vs string b

Alexander Tiskin (Warwick) Semi-local LCS 44 / 164

Page 87: A superglue for string comparison

Semi-local string comparisonSemi-local LCS and edit distance

Semi-local LCS on the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A blue = 0red = 1

score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5

String-substring LCS: all highest-score top-to-bottom paths

Semi-local LCS: all highest-score boundary-to-boundary paths

Alexander Tiskin (Warwick) Semi-local LCS 45 / 164

Page 88: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

Alexander Tiskin (Warwick) Semi-local LCS 46 / 164

Page 89: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

Semi-local LCS: output representation and running time

size query time

O(n2) O(1) trivial

O(m1/2n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006]

. . . or any 2D orthogonal range counting data structure

running time

O(mn2) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006]O(

mnlog0.5 n

)[T: 2006]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local LCS 47 / 164

Page 90: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

H(i , j): the number of matched characters for a vs substring b〈i : j〉j − i − H(i , j): the number of unmatched characters

Properties of matrix j − i − H(i , j):

simple unit-Monge

therefore, = PΣ, where P = −H� is a permutation matrix

P is the seaweed matrix, giving an implicit representation of H

Range tree for P: memory O(n log n), query time O(log2 n)

Alexander Tiskin (Warwick) Semi-local LCS 48 / 164

Page 91: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local LCS 49 / 164

Page 92: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local LCS 49 / 164

Page 93: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local LCS 49 / 164

Page 94: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

Alexander Tiskin (Warwick) Semi-local LCS 50 / 164

Page 95: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The (combed) seaweed braid in the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

P(i , j) = 1 corresponds to seaweed top i bottom j

Also define seaweeds top right, left right, left bottom

Represent implicitly semi-local LCS for each prefix of a vs b

Alexander Tiskin (Warwick) Semi-local LCS 51 / 164

Page 96: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

The (combed) seaweed braid in the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

P(i , j) = 1 corresponds to seaweed top i bottom j

Also define seaweeds top right, left right, left bottom

Represent implicitly semi-local LCS for each prefix of a vs b

Alexander Tiskin (Warwick) Semi-local LCS 51 / 164

Page 97: A superglue for string comparison

Semi-local string comparisonScore matrices and seaweed matrices

Seaweed braid: a highly symmetric object (element of H0(Sn))

Can be built by assembling subbraids: divide-and-conquer

Flexible approach to local alignment, compressed approximate matching,parallel computation. . .

Alexander Tiskin (Warwick) Semi-local LCS 52 / 164

Page 98: A superglue for string comparison

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment problem

Scoring scheme: match score wM , mismatch score wX , gap score wG

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Scoring scheme is rational, if wM , wX , wG are rational numbers

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to a scoring scheme wM = 0:

insertion/deletion cost −wG

substitution cost −wX

Alexander Tiskin (Warwick) Semi-local LCS 53 / 164

Page 99: A superglue for string comparison

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment problem

Scoring scheme: match score wM , mismatch score wX , gap score wG

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Scoring scheme is rational, if wM , wX , wG are rational numbers

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to a scoring scheme wM = 0:

insertion/deletion cost −wG

substitution cost −wX

Alexander Tiskin (Warwick) Semi-local LCS 53 / 164

Page 100: A superglue for string comparison

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment problem

Scoring scheme: match score wM , mismatch score wX , gap score wG

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Scoring scheme is rational, if wM , wX , wG are rational numbers

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to a scoring scheme wM = 0:

insertion/deletion cost −wG

substitution cost −wX

Alexander Tiskin (Warwick) Semi-local LCS 53 / 164

Page 101: A superglue for string comparison

Semi-local string comparisonWeighted alignment

Weighted alignment graph for a, b

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A blue = 0red (solid) = 2

red (dotted) = 1

Levenshtein(“BAABCBCA”, “CABCABA”) = 11

Alexander Tiskin (Warwick) Semi-local LCS 54 / 164

Page 102: A superglue for string comparison

Semi-local string comparisonWeighted alignment

Reduction: ordinary alignment graph for blown-up a, b

$B

$A

$A

$B

$C

$B

$C

$A

$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A blue = 0red = 1 or 2

Levenshtein(“BAABCBCA”, “CABCABA”) =lcs(“$B$A$A$B$C$B$C$A”, “$C$A$B$C$A$B$A”) = 11

Alexander Tiskin (Warwick) Semi-local LCS 55 / 164

Page 103: A superglue for string comparison

Semi-local string comparisonWeighted alignment

Reduction: rational-weighted semi-local alignment to semi-local LCS

$B

$A

$A

$B

$C

$B

$C

$A

$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A

Let wM = 1, wX = µν , wG = 0

Increase × ν2 in complexity (can be reduced to ν)

Alexander Tiskin (Warwick) Semi-local LCS 56 / 164

Page 104: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 57 / 164

Page 105: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 58 / 164

Page 106: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 107: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 108: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 109: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 110: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 111: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 112: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 113: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 114: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 115: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 116: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 117: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 118: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 119: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 120: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 121: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 122: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 123: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 124: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 125: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 126: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 127: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 128: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 129: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 130: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 131: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 132: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 133: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 134: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 135: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 136: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 137: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 138: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 139: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 140: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 141: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 142: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 59 / 164

Page 143: A superglue for string comparison

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 60 / 164

Page 144: A superglue for string comparison

The seaweed methodSeaweed combing

Semi-local LCS: seaweed combing [T: 2006]

Initialise uncombed seaweed braid: mismatch cell = crossing

Sweep cells in any �-compatible order

match cell: skip (keep uncrossed)

mismatch cell: comb (uncross) iff the same seaweed pair alreadycrossed before

Cell update: time O(1)

Overall time O(mn)

Correctness: by seaweed monoid relations

Alexander Tiskin (Warwick) Semi-local LCS 61 / 164

Page 145: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 62 / 164

Page 146: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 147: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 148: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 149: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 150: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 151: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 152: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 153: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 154: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 155: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 156: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 157: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 158: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 63 / 164

Page 159: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 64 / 164

Page 160: A superglue for string comparison

The seaweed methodMicro-block seaweed combing

Semi-local LCS: micro-block seaweed combing [T: 2007]

Initialise uncombed seaweed braid: mismatch cell = crossing

Sweep cells in micro-blocks, in any �-compatible order

Micro-block size: t = O( log n

log log n

)Micro-block interface:

O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits

O(t) integers, each O(log n) bits, can be reduced to O(log t) bits

Micro-block update: time O(1), by precomputing all possible interfaces

Overall time O(mn(log log n)2

log2 n

)

Alexander Tiskin (Warwick) Semi-local LCS 65 / 164

Page 161: A superglue for string comparison

The seaweed methodCyclic LCS

The cyclic LCS problem

Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time

O(mn2

log n

)naive

O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local LCS 66 / 164

Page 162: A superglue for string comparison

The seaweed methodCyclic LCS

The cyclic LCS problem

Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time

O(mn2

log n

)naive

O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local LCS 66 / 164

Page 163: A superglue for string comparison

The seaweed methodCyclic LCS

Cyclic LCS: the algorithm

Micro-block seaweed combing on a vs bb, time O(mn(log log n)2

log2 n

)Make n string-substring LCS queries, time negligible

Alexander Tiskin (Warwick) Semi-local LCS 67 / 164

Page 164: A superglue for string comparison

The seaweed methodLongest repeating subsequence

The longest repeating subsequence problem

Find the longest subsequence of a that is a square (a repetition of twoidentical strings)

Longest repeating subsequence: running time

O(m3) naiveO(m2) [Kosowski: 2004]

O(m2(log log m)2

log2 m

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local LCS 68 / 164

Page 165: A superglue for string comparison

The seaweed methodLongest repeating subsequence

The longest repeating subsequence problem

Find the longest subsequence of a that is a square (a repetition of twoidentical strings)

Longest repeating subsequence: running time

O(m3) naiveO(m2) [Kosowski: 2004]

O(m2(log log m)2

log2 m

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local LCS 68 / 164

Page 166: A superglue for string comparison

The seaweed methodLongest repeating subsequence

Longest repeating subsequence: the algorithm

Micro-block seaweed combing on a vs a, time O(m2(log log m)2

log2 m

)Make m − 1 prefix-suffix LCS queries, time negligible

Open question: generalise to longest subsequence that is a cube, etc.

Alexander Tiskin (Warwick) Semi-local LCS 69 / 164

Page 167: A superglue for string comparison

The seaweed methodApproximate matching

The approximate pattern matching problem

Give the substring closest to a by alignment score, starting at eachposition in b

Assume rational scoring scheme

Approximate pattern matching: running time

O(mn) [Sellers: 1980]O(

mnlog n

)σ = O(1) via [Masek, Paterson: 1980]

O(mn(log log n)2

log2 n

)via [Bille, Farach-Colton: 2008]

Alexander Tiskin (Warwick) Semi-local LCS 70 / 164

Page 168: A superglue for string comparison

The seaweed methodApproximate matching

Approximate pattern matching: the algorithm

Micro-block seaweed combing on a vs b (with blow-up), time

O(mn(log log n)2

log2 n

)The implicit semi-local edit score matrix:

an anti-Monge matrix

approximate pattern matching ∼ row minima

Row minima in O(n) element queries [Aggarwal+: 1987]

Each query in time O(log2 n) using the range tree representation,combined query time negligible

Overall running time O(mn(log log n)2

log2 n

), same as [Bille, Farach-Colton: 2008]

Alexander Tiskin (Warwick) Semi-local LCS 71 / 164

Page 169: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 72 / 164

Page 170: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

The periodic string-substring LCS problem

Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u±∞

Let u be of length p

May assume that every character of a occurs in u (otherwise delete it)

Only substrings of b of length ≤ mp (otherwise LCS score = m)

Alexander Tiskin (Warwick) Semi-local LCS 73 / 164

Page 171: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 74 / 164

Page 172: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 173: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 174: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 175: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 176: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 177: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 178: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 179: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 180: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 181: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 182: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 183: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 184: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 185: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 186: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 187: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 188: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 189: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 190: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 191: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 192: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 193: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 194: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 195: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 196: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 197: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 198: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 199: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 200: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 201: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 202: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 203: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 204: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 205: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 206: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 207: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 208: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 209: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 210: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 211: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 75 / 164

Page 212: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local LCS 76 / 164

Page 213: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

Periodic string-substring LCS: Wraparound seaweed combing

Initialise uncombed seaweed braid: mismatch cell = crossing

Sweep cells row-by-row: each row starts at match cell, wraps at boundary

Sweep cells in any �-compatible order

match cell: skip (keep uncrossed)

mismatch cell: comb (uncross) iff the same seaweed pair alreadycrossed before, possibly in a different period

Cell update: time O(1)

Overall time O(mn)

String-substring LCS score: count seaweeds with multiplicities

Alexander Tiskin (Warwick) Semi-local LCS 77 / 164

Page 214: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

The tandem LCS problem

Give LCS score for a vs b = uk

We have n = kp; may assume k ≤ m

Tandem LCS: running time

O(mkp) naiveO(m(k + p)

)[Landau, Ziv-Ukelson: 2001]

O(mp) [T: 2009]

Direct application of wraparound seaweed combing

Alexander Tiskin (Warwick) Semi-local LCS 78 / 164

Page 215: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

The tandem alignment problem

Give the substring closest to a in b = u±∞ by alignment score among

global: substrings uk of length kp across all k

cyclic: substrings of length kp across all k

local: substrings of any length

Tandem alignment: running time

O(m2p) all naive

O(mp) global [Myers, Miller: 1989]

O(mp log p) cyclic [Benson: 2005]O(mp) cyclic [T: 2009]

O(mp) local [Myers, Miller: 1989]

Alexander Tiskin (Warwick) Semi-local LCS 79 / 164

Page 216: A superglue for string comparison

Periodic string comparisonWraparound seaweed combing

Cyclic tandem alignment: the algorithm

Periodic seaweed combing for a vs b (with blow-up), time O(mp)

For each k ∈ [1 : m]:

solve tandem LCS (under given alignment score) for a vs uk

obtain scores for a vs p successive substrings of b of length kp by LCSbatch query: time O(1) per substring

Running time O(mp)

Alexander Tiskin (Warwick) Semi-local LCS 80 / 164

Page 217: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 81 / 164

Page 218: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

The LCS problem on permutation strings (LCSP)

Give LCS score for a vs b; in each of a, b all characters distinct

Equivalent to

longest increasing subsequence (LIS) in a string

maximum clique in a permutation graph

maximum planar matching in an embedded bipartite graph

LCSP: running time

O(n log n) implicit in [Erdos, Szekeres: 1935][Robinson: 1938; Knuth: 1970; Dijkstra: 1980]

O(n log log n) unit-RAM [Chang, Wang: 1992][Bespamyatnikh, Segal: 2000]

Alexander Tiskin (Warwick) Semi-local LCS 82 / 164

Page 219: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

Semi-local LCSP

Give semi-local LCS scores for a vs b; in each of a, b all characters distinct

Equivalent to

longest increasing subsequence (LIS) in every substring of a string

Semi-local LCSP: running time

O(n2 log n) naiveO(n2) restricted [Albert+: 2003; Chen+: 2005]O(n1.5 log n) randomised, restricted [Albert+: 2007]O(n1.5) [T: 2006]O(n log2 n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 83 / 164

Page 220: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local LCS 84 / 164

Page 221: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local LCS 85 / 164

Page 222: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local LCS 86 / 164

Page 223: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local LCS 87 / 164

Page 224: A superglue for string comparison

Sparse string comparisonSemi-local LCS between permutations

Semi-local LCSP: the algorithm

Divide-and-conquer on the alignment graph

Divide graph (say) horizontally; two subproblems of effective size n/2

Conquer: seaweed matrix �-multiplication, time O(n log n)

Overall time O(n log2 n)

Alexander Tiskin (Warwick) Semi-local LCS 88 / 164

Page 225: A superglue for string comparison

Sparse string comparisonLongest piecewise monotone subsequences

A k-increasing sequence: a concatenation of k increasing sequences

A (k − 1)-modal sequence: a concatenation of k alternating increasing anddecreasing sequences

The longest k-increasing (or (k − 1)-modal) subsequence problem

Give the longest k-increasing ((k − 1)-modal) subsequence of string b

Longest k-increasing (or (k − 1)-modal) subsequence: running time

O(nk log n) (k − 1)-modal [Demange+: 2007]O(nk log n) via [Hunt, Szymanski: 1977]O(n log2 n) [T: 2010]

Main idea: lcs(idk , b) (respectively, lcs((id id)k/2, b))Alexander Tiskin (Warwick) Semi-local LCS 89 / 164

Page 226: A superglue for string comparison

Sparse string comparisonLongest piecewise monotone subsequences

Longest k-increasing subsequence: algorithm A

Sparse LCS for idk vs b: time O(nk log n)

Longest k-increasing subsequence: algorithm B

Semi-local LCSP for id vs b: time O(n log2 n)

Extract string-substring LCSP for id vs b

String-substring LCS for idk vs b by at most 2 log k instances of�-square/multiply: time log k · O(n log n) = O(n log2 n)

Query the LCS score for idk vs b: time negligible

Overall time O(n log2 n)

Algorithm B faster than Algorithm A for k ≥ log n

Alexander Tiskin (Warwick) Semi-local LCS 90 / 164

Page 227: A superglue for string comparison

Sparse string comparisonLongest piecewise monotone subsequences

Longest k-increasing subsequence: algorithm A

Sparse LCS for idk vs b: time O(nk log n)

Longest k-increasing subsequence: algorithm B

Semi-local LCSP for id vs b: time O(n log2 n)

Extract string-substring LCSP for id vs b

String-substring LCS for idk vs b by at most 2 log k instances of�-square/multiply: time log k · O(n log n) = O(n log2 n)

Query the LCS score for idk vs b: time negligible

Overall time O(n log2 n)

Algorithm B faster than Algorithm A for k ≥ log n

Alexander Tiskin (Warwick) Semi-local LCS 90 / 164

Page 228: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

The maximum clique problem in a circle graph

Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords

S

Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)

Chords intersect iff intervals overlap, i.e. intersect without containment

Alexander Tiskin (Warwick) Semi-local LCS 91 / 164

Page 229: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

The maximum clique problem in a circle graph

Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords

S

Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)

Chords intersect iff intervals overlap, i.e. intersect without containmentAlexander Tiskin (Warwick) Semi-local LCS 91 / 164

Page 230: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Maximum clique in a circle graph: running time

exp(n) naiveO(n3) [Gavril: 1973]O(n2) [Rotem, Urrutia: 1981; Hsu: 1985]

[Masuda+: 1990; Apostolico+: 1992]O(n1.5) [T: 2006]O(n log2 n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 92 / 164

Page 231: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 232: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 233: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 234: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 235: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 236: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 237: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 238: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 239: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 240: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 241: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 93 / 164

Page 242: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Maximum clique in a circle graph: the algorithm

Helly property: if any set of intervals intersect pairwise, then they allintersect at a common point

Compute seaweed matrix, build range tree: time O(n log2 n)

Run through all 2n + 1 possible common intersection points

For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)

Overall time O(n log2 n) + O(n log2 n) = O(n log2 n)

Alexander Tiskin (Warwick) Semi-local LCS 94 / 164

Page 243: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Parameterised maximum clique in a circle graph

The maximum clique problem in a circle graph, sensitive e.g. to

number e of edges; e ≤ n2

size l of maximum clique; l ≤ n

cutwidth d of interval model (max number of intervals covering apoint); l ≤ d ≤ n

Parameterised maximum clique in a circle graph: running time

O(n log n + e) [Apostolico+: 1992]

O(n log n + nl log(n/l)) [Apostolico+: 1992]

O(n log n + n log2 d) NEW

Alexander Tiskin (Warwick) Semi-local LCS 95 / 164

Page 244: A superglue for string comparison

Sparse string comparisonMaximum clique in a circle graph

Parameterised maximum clique in a circle graph: the algorithm

For each diagonal block of size d , compute seaweed matrix, build rangetree: time n/d · O(d log2 d) = O(n log2 d)

Extend each diagonal block to a quadrant: time O(n log2 d)

Run through all 2n + 1 possible common intersection points

For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time O(n log2 d)

Overall time O(n log2 d)

Alexander Tiskin (Warwick) Semi-local LCS 96 / 164

Page 245: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 97 / 164

Page 246: A superglue for string comparison

Compressed string comparisonGrammar compression

Notation: pattern p of length m; text t of length n

A GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form

tk = α, where α is an alphabet character

tk = uv , where each of u, v is an alphabet character, or ti for i < k

In general, n = O(2n)

Example: Fibonacci string “ABAABABAABAAB”

t1 = A t2 = t1B t3 = t2t1 t4 = t3t2 t5 = t4t3 t6 = t5t4

Alexander Tiskin (Warwick) Semi-local LCS 98 / 164

Page 247: A superglue for string comparison

Compressed string comparisonGrammar compression

Grammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)

Simplifying assumption: arithmetic up to n runs in O(1)

This assumption can be removed by careful index remapping

Alexander Tiskin (Warwick) Semi-local LCS 99 / 164

Page 248: A superglue for string comparison

Compressed string comparisonExtended substring-string LCS on GC-strings

LCS: running time (r = m + n, r = m + n)

p t

plain plain O(mn) [Wagner, Fischer: 1974]O(

mnlog2 m

)[Masek, Paterson: 1980]

[Crochemore+: 2003]

plain GC O(m3n + . . .) gen. CFG [Myers: 1995]O(m1.5n) ext subs-s [T: 2008]O(m logm · n) ext subs-s [T: 2010]

GC GC NP-hard [Lifshits: 2005]O(r1.2r1.4) R weights [Hermelin+: 2009]O(r log r · r) [T: 2010]O(r log(r/r) · r

)[Hermelin+: 2010]

O(r log1/2(r/r) · r

)[Gawrychowski: 2012]

Alexander Tiskin (Warwick) Semi-local LCS 100 / 164

Page 249: A superglue for string comparison

Compressed string comparisonExtended substring-string LCS on GC-strings

Substring-string LCS (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of semi-local LCSfor p vs tk , using seaweed matrix �-multiplication: time O(m logm · n)

Overall time O(m logm · n)

Alexander Tiskin (Warwick) Semi-local LCS 101 / 164

Page 250: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

The global subsequence recognition problem

Does pattern p appear in text t as a subsequence?

Global subsequence recognition: running time

p t

plain plain O(n) greedy

plain GC O(mn) greedy

GC GC NP-hard [Lifshits: 2005]

Alexander Tiskin (Warwick) Semi-local LCS 102 / 164

Page 251: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

The local subsequence recognition problem

Find all minimally matching substrings of t with respect to p

Substring of t is matching, if p is a subsequence of t

Matching substring of t is minimally matching, if none of its propersubstrings are matching

Alexander Tiskin (Warwick) Semi-local LCS 103 / 164

Page 252: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition: running time ( + output)

p t

plain plain O(mn) [Mannila+: 1995]O(

mnlog m

)[Das+: 1997]

O(cm + n) [Boasson+: 2001]O(m + nσ) [Tronicek: 2001]

plain GC O(m2 logmn) [Cegielski+: 2006]O(m1.5n) [T: 2008]O(m logm · n) [T: 2010]O(m · n) [Yamamoto+: 2011]

GC GC NP-hard [Lifshits: 2005]

Alexander Tiskin (Warwick) Semi-local LCS 104 / 164

Page 253: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

ı0+

Oı1+

Oı2+

O

M0+

M1+

M2+

0H

n′

HnH

b〈i : j〉 matching iff box [i : j ] not pierced left-to-right

Determined by �-chain of ≷-maximal seaweeds

b〈i : j〉 minimally matching iff (i , j) is in the interleaved skeleton �-chain

Alexander Tiskin (Warwick) Semi-local LCS 105 / 164

Page 254: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

0+ 1+ 2+n′ n m+n

−m

0

n′

ı0+

ı1+

ı2+

Alexander Tiskin (Warwick) Semi-local LCS 106 / 164

Page 255: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of semi-local LCSfor p vs tk , using seaweed matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, find by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

Its skeleton �-chain: minimally matching substrings in t overlapping t ′, t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local LCS 107 / 164

Page 256: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of semi-local LCSfor p vs tk , using seaweed matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, find by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

Its skeleton �-chain: minimally matching substrings in t overlapping t ′, t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local LCS 107 / 164

Page 257: A superglue for string comparison

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of semi-local LCSfor p vs tk , using seaweed matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, find by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

Its skeleton �-chain: minimally matching substrings in t overlapping t ′, t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local LCS 107 / 164

Page 258: A superglue for string comparison

Compressed string comparisonThreshold approximate matching

The threshold approximate matching problem

Find all matching substrings of t with respect to p, according to athreshold k

Substring of t is matching, if the edit distance for p vs t is at most k

Alexander Tiskin (Warwick) Semi-local LCS 108 / 164

Page 259: A superglue for string comparison

Compressed string comparisonThreshold approximate matching

Threshold approximate matching: running time ( + output)

p t

plain plain O(mn) [Sellers: 1980]O(mk) [Landau, Vishkin: 1989]

O(m + n + nk4

m ) [Cole, Hariharan: 2002]

plain GC O(mnk2) [Karkkainen+: 2003]O(mnk + n log n) [LV: 1989] via [Bille+: 2010]O(mn + nk4 + n log n) [CH: 2002] via [Bille+: 2010]O(m logm · n) [T: NEW]

GC GC NP-hard [Lifshits: 2005]

(Also many specialised variants for LZ compression)

Alexander Tiskin (Warwick) Semi-local LCS 109 / 164

Page 260: A superglue for string comparison

Compressed string comparisonThreshold approximate matching

ı0+

Oı1+

Oı2+

Oı3+

Oı4+

O

M0+

M1+

M2+

M3+

M4+

0H

n′

HnH

Blow up: weighted alignment on strings p, t of size m, n equivalent toLCS on strings p, t of size m = νm, n = νn

Alexander Tiskin (Warwick) Semi-local LCS 110 / 164

Page 261: A superglue for string comparison

Compressed string comparisonThreshold approximate matching

0+ 1+ 2+ 3+ 4+n′ n m+n

ı0+

ı1+

ı2+

ı3+

ı4+

−m

0

n′

Alexander Tiskin (Warwick) Semi-local LCS 111 / 164

Page 262: A superglue for string comparison

Compressed string comparisonThreshold approximate matching

Threshold approx matching (plain pattern, GC text): the algorithm

Algorithm structure similar to local subsequence recognition

�-chains replaced by m ×m submatrices

Extra ingredients:

the blow-up technique: reduction of edit distances to LCS scores

implicit matrix searching, replaces �-chain interleaving

Monge row minima: “SMAWK” O(m) [Aggarwal+: 1987]

Implicit unit-Monge row minima:

O(m log logm) [T: 2012]O(m) [Gawrychowski: 2012]

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local LCS 112 / 164

Page 263: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 113 / 164

Page 264: A superglue for string comparison

Parallel string comparisonParallel LCS

Bulk-Synchronous Parallel (BSP) computer [Valiant: 1990]

Simple, realistic general-purpose parallelmodel

COMM. ENV . (g , l)

0

PM

1

PM

p − 1

PM· · ·

Contains

p processors, each with local memory (1 time unit/operation)

communication environment, including a network and an externalmemory (g time units/data unit communicated)

barrier synchronisation mechanism (l time units/synchronisation)

Alexander Tiskin (Warwick) Semi-local LCS 114 / 164

Page 265: A superglue for string comparison

Parallel string comparisonParallel LCS

Bulk-Synchronous Parallel (BSP) computer [Valiant: 1990]

Simple, realistic general-purpose parallelmodel

COMM. ENV . (g , l)

0

PM

1

PM

p − 1

PM· · ·

Contains

p processors, each with local memory (1 time unit/operation)

communication environment, including a network and an externalmemory (g time units/data unit communicated)

barrier synchronisation mechanism (l time units/synchronisation)

Alexander Tiskin (Warwick) Semi-local LCS 114 / 164

Page 266: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 267: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 268: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 269: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 270: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1 · · ·

· · ·

· · ·

· · ·

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 271: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1 · · ·

· · ·

· · ·

· · ·

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 272: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP computation: sequence of parallel supersteps

0

1

p − 1 · · ·

· · ·

· · ·

· · ·

Asynchronous computation/communication within supersteps (includesdata exchange with external memory)

Synchronisation before/after each superstep

Cf. CSP: parallel collection of sequential processes

Alexander Tiskin (Warwick) Semi-local LCS 115 / 164

Page 273: A superglue for string comparison

Parallel string comparisonParallel LCS

Compositional cost model

For individual processor proc in superstep sstep:

comp(sstep, proc): the amount of local computation and localmemory operations by processor proc in superstep sstep

comm(sstep, proc): the amount of data sent and received byprocessor proc in superstep sstep

For the whole BSP computer in one superstep sstep:

comp(sstep) = max0≤proc<p comp(sstep, proc)

comm(sstep) = max0≤proc<p comm(sstep, proc)

cost(sstep) = comp(sstep) + comm(sstep) · g + l

Alexander Tiskin (Warwick) Semi-local LCS 116 / 164

Page 274: A superglue for string comparison

Parallel string comparisonParallel LCS

Compositional cost model

For individual processor proc in superstep sstep:

comp(sstep, proc): the amount of local computation and localmemory operations by processor proc in superstep sstep

comm(sstep, proc): the amount of data sent and received byprocessor proc in superstep sstep

For the whole BSP computer in one superstep sstep:

comp(sstep) = max0≤proc<p comp(sstep, proc)

comm(sstep) = max0≤proc<p comm(sstep, proc)

cost(sstep) = comp(sstep) + comm(sstep) · g + l

Alexander Tiskin (Warwick) Semi-local LCS 116 / 164

Page 275: A superglue for string comparison

Parallel string comparisonParallel LCS

For the whole BSP computation with sync supersteps:

comp =∑

0≤sstep<sync comp(sstep)

comm =∑

0≤sstep<sync comm(sstep)

cost =∑

0≤sstep<sync cost(sstep) = comp + comm · g + sync · l

The input/output data are stored in the external memory; the cost ofinput/output is included in comm

E.g. for a particular linear system solver with an n × n matrix:

comp = O(n3/p) comm = O(n2/p1/2) sync = O(p1/2)

Alexander Tiskin (Warwick) Semi-local LCS 117 / 164

Page 276: A superglue for string comparison

Parallel string comparisonParallel LCS

For the whole BSP computation with sync supersteps:

comp =∑

0≤sstep<sync comp(sstep)

comm =∑

0≤sstep<sync comm(sstep)

cost =∑

0≤sstep<sync cost(sstep) = comp + comm · g + sync · l

The input/output data are stored in the external memory; the cost ofinput/output is included in comm

E.g. for a particular linear system solver with an n × n matrix:

comp = O(n3/p) comm = O(n2/p1/2) sync = O(p1/2)

Alexander Tiskin (Warwick) Semi-local LCS 117 / 164

Page 277: A superglue for string comparison

Parallel string comparisonParallel LCS

For the whole BSP computation with sync supersteps:

comp =∑

0≤sstep<sync comp(sstep)

comm =∑

0≤sstep<sync comm(sstep)

cost =∑

0≤sstep<sync cost(sstep) = comp + comm · g + sync · l

The input/output data are stored in the external memory; the cost ofinput/output is included in comm

E.g. for a particular linear system solver with an n × n matrix:

comp = O(n3/p) comm = O(n2/p1/2) sync = O(p1/2)

Alexander Tiskin (Warwick) Semi-local LCS 117 / 164

Page 278: A superglue for string comparison

Parallel string comparisonParallel LCS

BSP software: industrial projects

Google’s Pregel [2010]

Apache Spark (hama.apache.org) [2010]

Apache Giraph (giraph.apache.org) [2011]

BSP software: research projects

Oxford BSP (www.bsp-worldwide.org/implmnts/oxtool) [1998]

Paderborn PUB (www2.cs.uni-paderborn.de/~pub) [1998]

BSML (traclifo.univ-orleans.fr/BSML) [1998]

BSPonMPI (bsponmpi.sourceforge.net) [2006]

Multicore BSP (www.multicorebsp.com) [2011]

Epiphany BSP (www.codu.in/ebsp) [2015]

Petuum (petuum.org) [2015]

Alexander Tiskin (Warwick) Semi-local LCS 118 / 164

Page 279: A superglue for string comparison

Parallel string comparisonParallel LCS

The ordered 2D grid dag

grid2(n)

nodes arranged in an n × n grid

edges directed top-to-bottom, left-to-right

≤ 2n inputs (to left/top borders)

≤ 2n outputs (from right/bottom borders)

size n2 depth 2n − 1

Applications: triangular linear system; discretised PDE via Gauss–Seideliteration (single step); 1D cellular automata; dynamic programming

Sequential work O(n2)

Alexander Tiskin (Warwick) Semi-local LCS 119 / 164

Page 280: A superglue for string comparison

Parallel string comparisonParallel LCS

The ordered 2D grid dag

grid2(n)

nodes arranged in an n × n grid

edges directed top-to-bottom, left-to-right

≤ 2n inputs (to left/top borders)

≤ 2n outputs (from right/bottom borders)

size n2 depth 2n − 1

Applications: triangular linear system; discretised PDE via Gauss–Seideliteration (single step); 1D cellular automata; dynamic programming

Sequential work O(n2)

Alexander Tiskin (Warwick) Semi-local LCS 119 / 164

Page 281: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 282: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 283: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 284: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 285: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 286: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 287: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation

grid2(n)

Consists of a p × p grid of blocks, eachisomorphic to grid2(n/p)

The blocks can be arranged into 2p − 1anti-diagonal layers, with ≤ p independentblocks in each layer

Alexander Tiskin (Warwick) Semi-local LCS 120 / 164

Page 288: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation (contd.)

The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage:

every block assigned to a different processor (some processors idle)

the processor reads the 2n/p block inputs, computes the block, andwrites back the 2n/p block outputs

comp: (2p − 1) · O((n/p)2

)= O(p · n2/p2) = O(n2/p)

comm: (2p − 1) · O(n/p) = O(n)

n ≥ p

comp O(n2/p) comm O(n) sync O(p)

Alexander Tiskin (Warwick) Semi-local LCS 121 / 164

Page 289: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation (contd.)

The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage:

every block assigned to a different processor (some processors idle)

the processor reads the 2n/p block inputs, computes the block, andwrites back the 2n/p block outputs

comp: (2p − 1) · O((n/p)2

)= O(p · n2/p2) = O(n2/p)

comm: (2p − 1) · O(n/p) = O(n)

n ≥ p

comp O(n2/p) comm O(n) sync O(p)

Alexander Tiskin (Warwick) Semi-local LCS 121 / 164

Page 290: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel ordered 2D grid computation (contd.)

The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage:

every block assigned to a different processor (some processors idle)

the processor reads the 2n/p block inputs, computes the block, andwrites back the 2n/p block outputs

comp: (2p − 1) · O((n/p)2

)= O(p · n2/p2) = O(n2/p)

comm: (2p − 1) · O(n/p) = O(n)

n ≥ p

comp O(n2/p) comm O(n) sync O(p)

Alexander Tiskin (Warwick) Semi-local LCS 121 / 164

Page 291: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation

The 2D grid algorithm solves the LCS problem (and many others) bydynamic programming

comp = O(n2/p) comm = O(n) sync = O(p)

comm is not scalable (i.e. does not decrease with increasing p) :-(

Can scalable comm be achieved for the LCS problem?

Alexander Tiskin (Warwick) Semi-local LCS 122 / 164

Page 292: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation

The 2D grid algorithm solves the LCS problem (and many others) bydynamic programming

comp = O(n2/p) comm = O(n) sync = O(p)

comm is not scalable (i.e. does not decrease with increasing p) :-(

Can scalable comm be achieved for the LCS problem?

Alexander Tiskin (Warwick) Semi-local LCS 122 / 164

Page 293: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation

Solve the more general semi-local LCS problem:

each string vs all substrings of the other string

all prefixes of each string against all suffixes of the other string

Divide-and-conquer on substrings of a, b: log p recursion levels

Conquer phase: assembles substring semi-local LCS from smaller ones byparallel seaweed multiplication

Base level: p semi-local LCS subproblems on np1/2 -strings

Sequential time still O(n2)

Alexander Tiskin (Warwick) Semi-local LCS 123 / 164

Page 294: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation (cont.)

Parallel semi-local LCS: running time, comm, sync

comp comm sync

O(n2/p) O(n) O(p) 2D grid

O(n2/p) O(

np1/2

)O(log2 p) [Krusche, T: 2007]

O(n log p

p1/2

)O(log p) [Krusche, T: 2010]

O(

np1/2

)O(log p) [T: NEW]

Based on parallel Steady Ant algorithm

Alexander Tiskin (Warwick) Semi-local LCS 124 / 164

Page 295: A superglue for string comparison

Parallel string comparisonParallel LCS

Seaweed matrix �-multiplication: parallel Steady Ant

RΣ(i , k) = minj(PΣ(i , j) + QΣ(j , k)

)Divide-and-conquer on the range of j : two subproblems on n/2-matrices

PΣlo � QΣ

lo = RΣlo PΣ

hi � QΣhi = RΣ

hi

Conquer: tracing a balanced trail from bottom-left to top-right R

Partition n × n index space into a regular p × p grid of blocks

Sample R in block corners: total p2 samples

Steady Ant’s trail passes through ≤ 2p blocks

Partition blocks across processors: comp, comm = O(n/p)

comp = O(n log n

p

)comm = O

(n log pp

)sync = O(log p)

Alexander Tiskin (Warwick) Semi-local LCS 125 / 164

Page 296: A superglue for string comparison

Parallel string comparisonParallel LCS

Seaweed matrix �-multiplication: generalised parallel Steady Ant

RΣ(i , k) = minj(PΣ(i , j) + QΣ(j , k)

)Divide-and-conquer on the range of j : pε subproblems on n

pε -matrices

Conquer: tracing pε balanced trails from bottom-left to top-right R

Partition n × n index space into a regular p × p grid of blocks

Sample R in block corners: total p2 samples

Steady Ant’s trails pass through ≤ 2p1+ε blocks

Partition blocks across processors: comp, comm = O(

np1−ε

)comp = O

(n log np + n

εp1−ε

)comm = O

(n

p1−ε

)sync = O(1/ε)

Alexander Tiskin (Warwick) Semi-local LCS 126 / 164

Page 297: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation (cont.)

Divide-and-conquer on substrings of a, b: log p recursion levels

In every level, semi-local LCS subproblems on input substrings

Conquer phase: generalised parallel Steady Ant algorithm, ε = 1/3

base level: p = p1/2 × p1/2 subproblems on np1/2 -substrings; one

proc/subproblem; comm = O(

np1/2

)middle level: p1/2 = p1/4 × p1/4 subproblems on n

p1/4 -substrings; p1/2

procs/subproblem; comm = O( n/p1/4

(p1/2)1−1/3

)= O

(n

p7/12

)top level: one subproblem on n-strings; p procs;comm = O

(n

p1−1/3

)= O

(n

p2/3

)sync = O

(1

1/3

)= O(1) in every level

Alexander Tiskin (Warwick) Semi-local LCS 127 / 164

Page 298: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS computation (cont.)

comm dominated by base level:O(

np1/2

)+ . . .+ O

(n

p7/12

)+ . . .+ O

(n

p2/3

)= O

(n

p1/2

)comp = O

(n2

p

)comm = O

(n

p1/2

)sync = O(log p)

Alexander Tiskin (Warwick) Semi-local LCS 128 / 164

Page 299: A superglue for string comparison

Parallel string comparisonParallel LCS

Parallel LCS on permutation strings (LCSP = LIS)

Sequential time O(n log n), no good way known to parallelise directly

Solution via the more general semi-local LCSP is no longer efficient:sequential time O(n log2 n)

Divide-and-conquer on substrings/subsequences of a, b

Parallelising normal O(n log n) seaweed multiplication: [Krusche, T: 2010]

comp O(n log2 n

p

)comm O

(n log pp

)sync O(log2 p)

Open problem: can we achieve

comp O(n log n

p

)for LCSP?

comp O(n log2 n

p

), comm O(np ), sync O(log2 p) for semi-local LCSP?

Alexander Tiskin (Warwick) Semi-local LCS 129 / 164

Page 300: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 130 / 164

Page 301: A superglue for string comparison

The transposition network methodTransposition networks

Comparison network: a circuit of comparators

A comparator sorts two inputs and outputs them in prescribed order

Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks

# comparators

merging O(n log n) [Batcher: 1968]

sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams

Transposition network: all comparisons are between adjacent wires

Alexander Tiskin (Warwick) Semi-local LCS 131 / 164

Page 302: A superglue for string comparison

The transposition network methodTransposition networks

Comparison network: a circuit of comparators

A comparator sorts two inputs and outputs them in prescribed order

Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks

# comparators

merging O(n log n) [Batcher: 1968]

sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams

Transposition network: all comparisons are between adjacent wires

Alexander Tiskin (Warwick) Semi-local LCS 131 / 164

Page 303: A superglue for string comparison

The transposition network methodTransposition networks

Seaweed combing as a transposition network

A

C

B

C

A B C A

+7

+1

+5

+5

+3

+7

+1

−5

−1

−3

−3

+3

−5

−1

−7

−7

Character mismatches correspond to comparators

Inputs anti-sorted (sorted in reverse); each value traces a seaweed

Alexander Tiskin (Warwick) Semi-local LCS 132 / 164

Page 304: A superglue for string comparison

The transposition network methodTransposition networks

Global LCS: transposition network with binary input

A

C

B

C

A B C A

1

1

1

1

1

1

1

0

0

0

0

1

0

0

0

0

Inputs still anti-sorted, but may not be distinct

Comparison between equal values is indeterminate

Alexander Tiskin (Warwick) Semi-local LCS 133 / 164

Page 305: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison

String comparison sensitive e.g. to

low similarity: small λ = LCS(a, b)

high similarity: small κ = distLCS(a, b) = m + n − 2λ

Can also use weighted alignment score or edit distance

Assume m = n, therefore κ = 2(n − λ)

Alexander Tiskin (Warwick) Semi-local LCS 134 / 164

Page 306: A superglue for string comparison

The transposition network methodParameterised string comparison

Low-similarity comparison: small λ

sparse set of matches, may need to look at them all

preprocess matches for fast searching, time O(n log σ)

High-similarity comparison: small κ

set of matches may be dense, but only need to look at small subset

no need to preprocess, linear search is OK

Flexible comparison: sensitive to both high and low similarity, e.g. by bothcomparison types running alongside each other

Alexander Tiskin (Warwick) Semi-local LCS 135 / 164

Page 307: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: running time

Low-similarity, after preprocessing in O(n log σ)

O(nλ) [Hirschberg: 1977][Apostolico, Guerra: 1985]

[Apostolico+: 1992]

High-similarity, no preprocessing

O(n · κ) [Ukkonen: 1985][Myers: 1986]

Flexible

O(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990]O(λ · κ) after preproc [Rick: 1995]

Alexander Tiskin (Warwick) Semi-local LCS 136 / 164

Page 308: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 309: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 310: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 311: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 312: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 313: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 314: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 315: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 316: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 317: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 318: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 319: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 320: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 321: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 322: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 323: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 324: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 325: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 326: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 327: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 328: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 329: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 330: A superglue for string comparison

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

0 0 1 1 0 0 1 0

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)

0 0 0 0 0 0 0 0

1

1

1

1

1

1

1

1

1 1 1 0 1 1 0 0

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local LCS 137 / 164

Page 331: A superglue for string comparison

The transposition network methodDynamic string comparison

The dynamic LCS problem

Maintain current LCS score under updates to one or both input strings

Both input strings are streams, updated on-line:

prepending/appending characters at left or right

deleting characters at left or right

Assume for simplicity m ≈ n, i.e. m = Θ(n)

Goal: linear time per update

O(n) per update of a (n = |b|)O(m) per update of b (m = |a|)

Alexander Tiskin (Warwick) Semi-local LCS 138 / 164

Page 332: A superglue for string comparison

The transposition network methodDynamic string comparison

Dynamic LCS in linear time: update models

left right

– app+del standard DP [Wagner, Fischer: 1974]prep app a fixed [Landau+: 1998], [Kim, Park: 2004]prep app [Ishida+: 2005]prep+del app+del [T: NEW]

Dynamic semi-local LCS in linear time: update models

left right

prep+del app+del amortised [T: NEW]

Alexander Tiskin (Warwick) Semi-local LCS 139 / 164

Page 333: A superglue for string comparison

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison

String comparison using standard instructions on words of size w

Bit-parallel string comparison: running time

O(mn/w) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]

Alexander Tiskin (Warwick) Semi-local LCS 140 / 164

Page 334: A superglue for string comparison

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison: binary transposition network

In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ

µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1

+

∧µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)

S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µ

Alexander Tiskin (Warwick) Semi-local LCS 141 / 164

Page 335: A superglue for string comparison

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison: binary transposition network

In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ

µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1

+

∧µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)

S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µAlexander Tiskin (Warwick) Semi-local LCS 141 / 164

Page 336: A superglue for string comparison

The transposition network methodBit-parallel string comparison

High-similarity bit-parallel string comparison

κ = distLCS(a, b) Assume κ odd, m = n

Waterfall algorithm within diagonal band of width κ+ 1: time O(nκ/w)

Band waterfall supported from below by separator matches

Alexander Tiskin (Warwick) Semi-local LCS 142 / 164

Page 337: A superglue for string comparison

The transposition network methodBit-parallel string comparison

High-similarity bit-parallel multi-string comparison: a vs b0, . . . , br−1

κi = distLCS(a, bi ) ≤ κ 0 ≤ i < r

Waterfalls within r diagonal bands of width κ+ 1: time O(nrκ/w)

Each band’s waterfall supported from below by separator matches

Alexander Tiskin (Warwick) Semi-local LCS 143 / 164

Page 338: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 144 / 164

Page 339: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

Given window length w , let w -window be any substring of length w

The window-substring LCS problem

Give LCS score for every w -window of a vs every substring of b

Window-substring LCS: running time

O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 145 / 164

Page 340: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

Given window length w , let w -window be any substring of length w

The window-substring LCS problem

Give LCS score for every w -window of a vs every substring of b

Window-substring LCS: running time

O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 145 / 164

Page 341: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

Window-substring LCS: the algorithm

Compute seaweed matrices for canonical substrings of a vs b recursively

Compute seaweed matrices for w -windows of a vs b, using thedecomposition of each w -window into canonical substrings

Both stages use matrix �-multiplication, overall time O(mn)

Alexander Tiskin (Warwick) Semi-local LCS 146 / 164

Page 342: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

The window-window LCS problem

Give LCS score for every w -window of a vs every w -window of b

Provides an LCS-scored alignment plot (by analogy with Hamming-scoreddot plot), a useful tool for studying weak conservation in the genome

Window-window LCS: running time

O(mnw2) naiveO(mnw) repeated seaweed combingO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 147 / 164

Page 343: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

The window-window LCS problem

Give LCS score for every w -window of a vs every w -window of b

Provides an LCS-scored alignment plot (by analogy with Hamming-scoreddot plot), a useful tool for studying weak conservation in the genome

Window-window LCS: running time

O(mnw2) naiveO(mnw) repeated seaweed combingO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local LCS 147 / 164

Page 344: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

An implementation of alignment plots [Krusche: 2010]

Based on O(mnw) repeated seaweed combing approach, using some ideasfrom the optimal O(mn) algorithm: resulting time O(mnw0.5)

Weighted alignment (match 1, mismatch 0, gap −0.5) via alignment dagblow-up: ×4 slowdown

C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)

SMP parallelism (currently two processors)

Alexander Tiskin (Warwick) Semi-local LCS 148 / 164

Page 345: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

Krusche’s implementation used at Warwick Systems Biology Centre tostudy conservation in plant non-coding DNA

More sensitive than BLAST and other heuristic (seed-and-extend) methods

Found hundreds of conserved non-coding regions between papaya (Caricapapaya), poplar (Populus trichocarpa), thale cress (Arabidopsis thaliana)and grape (Vitis vinifera)

Single processor:

×10 speedup over heavily optimised, bit-parallel naive algorithm

×7 speedup over biologists’ heuristics

Two processors: extra ×2 speedup, near-perfect parallelism

Alexander Tiskin (Warwick) Semi-local LCS 149 / 164

Page 346: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

ÌÛÝØÒ×ÝßÔßÜÊßÒÝÛ

Ûª± «¬·±²¿®§ ¿²¿§­·­ ±º®»¹« ¿¬±®§ ­» «»²½»­ øÛßÎÍ÷·²° ¿²¬­

Û³³¿Ð·½±¬ïôîô묻® Õ®«­½»íôß» ¿²¼»® Ì·­µ·²íô×­¿¾»´» Ý¿®®»l ¼Í¿­½¿Ñ¬¬ìôö

ïͧ­¬»³­ Þ·± ±¹§ ܱ½¬±®¿Ì®¿·²·²¹Ý»²¬®»ô˲·ª»®­·¬§ ±ºÉ¿®©·½µôݱª»²¬®§ ÝÊìéßÔôËÕôîÜ»°¿®¬³»²¬ ±ºÞ·± ±¹·½¿Í½·»²½»­ô˲·ª»®­·¬§ ±ºÉ¿®©·½µôݱª»²¬®§ ÝÊìéßÔôËÕôíÜ»°¿®¬³»²¬ ±ºÝ±³°«¬»® ͽ·»²½»ô˲·ª»®­·¬§ ±ºÉ¿®©·½µôݱª»²¬®§ ÝÊìéßÔôËÕô¿²¼ìÉ¿®©·½µ ͧ­¬»³­ Þ·± ±¹§ Ý»²¬®»ô˲·ª»®­·¬§ ±ºÉ¿®©·½µôݱª»²¬®§ ÝÊìéßÔôËÕ

λ½»·ª»¼ïéÓ¿§ îðïðå®»ª·­»¼êÖ« § îðïð忽½»°¬»¼ïîÖ« § îðïðòöÚ±® ½±®®»­°±²¼»²½» øº¿õììîìéêëéëéçëå»ó³¿· ­ò±¬¬à©¿®©·½µò¿½ò«µ÷ò

ÍËÓÓßÎÇ

» «½·¼¿¬·±²±º¹»²» ²»¬©±®µ­òÌ·­ ¿°°®±¿½ ½±²­¬·¬«¬»­ ¿³¿¶±® ½¿»²¹»ô±©»ª»®ô¿­ ±²§ ¿ª»®§ ­³¿

º®¿½¬·±²±º²±²ó½±¼·²¹ÜÒß·­ ¬ ±«¹¬ ¬± ½±²¬®·¾«¬» ¬± ¹»²» ®»¹« ¿¬·±²òÌ» ³¿°°·²¹±º®»¹« ¿¬±®§ ®»¹·±²­

¬®¿¼·¬·±²¿§ ·²ª± ª»­ ¬ » ¿¾±®·±«­ ½±²­¬®«½¬·±²±º°®±³±¬»® ¼» »¬·±²­»®·»­ © ·½¿®» ¬ »²º«­»¼¬± ®»°±®¬»®

¹»²»­ ¿²¼¿­­¿§»¼·²¬®¿²­¹»²·½±®¹¿²·­³­òÞ·±·²º±®³¿¬·½³»¬ ±¼­ ½¿²¾» «­»¼¬± ­½¿²­» «»²½»­ º±®

³¿¬½»­ º±® µ²±©²®»¹« ¿¬±®§ ³±¬·º­ô±©»ª»® ¬ »­» ³»¬ ±¼­ ¿®» ½«®®»²¬ § ¿³°»®»¼¾§ ¬ » ®» ¿¬·ª» § ­³¿

¿³±«²¬ ±º­«½³±¬·º­ ¿²¼¾§ ¿·¹ º¿­»ó¼·­½±ª»®§ ®¿¬»òØ»®»ô©» ¼»³±²­¬®¿¬» ¿®±¾«­¬ ¿²¼ ·¹ § ­»²­·¬·ª»ô

·²­· ·½± ³»¬ ±¼¬± ·¼»²¬·º§ »ª± «¬·±²¿®· § ½±²­»®ª»¼®»¹·±²­ ©·¬ ·²²±²ó½±¼·²¹ÜÒßòÍ» «»²½» ½±²­»®ª¿¬·±²

©·¬ ·²¬ »­» ®»¹·±²­ ·­ ¬¿µ»²¿­ »ª·¼»²½» º±® »ª± «¬·±²¿®§ °®»­­«®» ¿¹¿·²­¬ ³«¬¿¬·±²­ô© ·½·­ ­«¹¹»­¬·ª» ±º

º«²½¬·±²¿·³°±®¬¿²½»òÉ» ¬»­¬ ¬ ·­ ³»¬ ±¼±²¿­³¿ ­»¬ ±º©»´½¿®¿½¬»®·­»¼°®±³±¬»®­ô¿²¼­ ±© ¬ ¿¬ ·¬

­» «»²½»­ ½±²¬¿·²½«­¬»®­ ±º¬®¿²­½®·°¬·±²¾·²¼·²¹­·¬»­ô±º¬»²¼»­½®·¾»¼¿­ ®»¹« ¿¬±®§ ³±¼« »­òߪ»®­·±²±º

¬ » ¬±± ±°¬·³·­»¼º±® ¬ » ¿²¿§­·­ ±º° ¿²¬ °®±³±¬»®­ ·­ ¿ª¿· ¿¾» ±²·²» ¿¬ ¬¬°æññ©­¾½ò©¿®©·½µò¿½ò«µñ»¿®­ñ

³¿·²ò° °ò

Õ»§©±®¼­æ®»¹« ¿¬±®§ ³±¼« »­ô­» «»²½» ½±²­»®ª¿¬·±²ô°®±³±¬»®­ô¬®¿²­½®·°¬·±²ôß®¿¾·¼±°­·­ò

×ÒÌÎÑÜËÝÌ×ÑÒ

Ë°­¬®»¿³±º¬ » ½±¼·²¹®»¹·±²±º»ª»®§ ¹»²» ·»­ ¬ » °®±ó

³±¬»®ô¿²¿®»¿±º²±²ó½±¼·²¹ÜÒ߬ ¿¬ ·­ ®» «·®»¼º±®

®»½®«·¬³»²¬ ±ºÎÒß°± §³»®¿­» ¿²¼º±® ¬®¿²­½®·°¬·±²¬± ¬¿µ»

° ¿½»òλ¹« ¿¬±®§ ®»¹·±²­ ¿®» ²±¬ ·³·¬»¼¬± ¬ » °®±³±¬»®ô

¾«¬ ³¿§ ¿­± ±½½«® ·²¬ » ·²¬®±²­ ±º¹»²»­ øͽ¿«»® »¬ ¿òô

îððç÷±® ·» ·²¬ » í~ ËÌÎøÝ¿© »§ »¬ ¿òôîððì÷òЮ±³±¬»®­

«­«¿§ ½±²¬¿·²³« ¬·° » ®»¹« ¿¬±®§ » »³»²¬­ ¬± © ·½ ¬®¿²ó

­½®·°¬·±²º¿½¬±®­ ¾·²¼·²¿­» «»²½»ó­°»½·B½³¿²²»® ¬±

³±¼« ¿¬» ¾·²¼·²¹±ºÎÒß°± §³»®¿­» ¿²¼»²­«®» ¬ ¿¬

¬®¿²­½®·°¬·±²¬¿µ»­ ° ¿½» ·²¬ » ¿°°®±°®·¿¬» ¬·­­«»­ ¿²¼¿¬ ¬ »

¿°°®±°®·¿¬» ¬·³»­òÌ®¿²­½®·°¬·±²º¿½¬±®­ ³¿§ ¿½¬ ¼·®»½¬ § ¬±

»·¬ »® °®±³±¬» ±® ·²·¾·¬ ¬®¿²­½®·°¬·±²ò߬»®²¿¬·ª» §ô¬ »§

³¿§ ¿½¬ ¬± ®»½®«·¬ ±¬ »® °®±¬»·²­ ¬± ¿½±³° » øÔ¿¬½³¿²ô

ïççé÷ò׬ ·­ ¬ »·® ½±³¾·²¿¬±®·¿»ºº»½¬ ¿¬ ¬ » »ª» ±º·²¼·ª·¼«¿

°®±³±¬»®­ ¬ ¿¬ ¹±ª»®²­ ¬®¿²­½®·°¬·±²¿®»­°±²­»­ ¬± ¾±¬¸

·²¬»®²¿¿²¼» ¬»®²¿­¬·³« · øÜ¿ª·¼­±²ôîððï¾÷òÍ°»½·B½

®»­°±²­»­ ¿®» ±º¬»²¿­­±½·¿¬»¼©·¬ ½«­¬»®­ ±º¬®¿²­½®·°¬·±²

º¿½¬±® ¾·²¼·²¹­·¬»­ øÌÚÞÍ÷ô¼»­½®·¾»¼¿­ ®»¹« ¿¬±®§ ³±¼« »­

øÎÓ­÷øÜ¿ª·¼­±²ôîððï¿åЮ·»­¬ »¬ ¿òôîððç÷ò×¼»²¬·B½¿¬·±²±º

ÎÓ­ ·­ «­«¿§ ¬ » B®­¬ ­¬»° ¬±©¿®¼­ «²¼»®­¬¿²¼·²¹¬ »

®»¹« ¿¬·±²±º¹»²» » °®»­­·±²òͬ«¼§·²¹¬ » ½±³¾·²¿¬±®·¿

¿²¼° §­·½¿¿®®¿²¹»³»²¬ ±º» »³»²¬­ ©·¬ ·²®»¹« ¿¬±®§

³±¼« »­ ·­ ¿­± ·³°±®¬¿²¬ ¬± ®»ª»¿·²¬»®¿½¬·±²­ ¾»¬©»»²

¬®¿²­½®·°¬·±²º¿½¬±®­ ¿²¼¬ »·® º«²½¬·±²¿­ °¿®¬ ±º®»¹« ¿¬±®§

³±¼« »­ò

Ú«²½¬·±²¿®»¹·±²­ ©·¬ ·²°®±³±¬»®­ ½¿²¾» ·¼»²¬·B»¼

» °»®·³»²¬¿§ ¾§ ¹»²»®¿¬·²¹®»°±®¬»® ½±²­¬®«½¬­ ¿²¼¿²¿ó

§­·²¹» °®»­­·±²°¿¬¬»®²­ ·²¬®¿²­¹»²·½½»´·²»­ ±® ±®¹¿²ó

·­³­òλ¹« ¿¬±®§ ®»¹·±²­ ¿®» ¬ »²³¿°°»¼¾§ ¿²¿§­·²¹¬ »

»ºº»½¬ ±º¿¼» »¬·±²­»®·»­òÌ·­ ¿°°®±¿½ ·­ ²±¬ ·¼»¿ô

v îðïð˲·ª»®­·¬§ ±ºÉ¿®©·½µ ïêëÖ±«®²¿½±³°· ¿¬·±²v îðïðÞ¿½µ©»´Ð«¾·­ ·²¹Ô¬¼

Ì» п²¬ Ö±«®²¿øîðïð÷êìôïêëPïéê ¼±·æïðòïïïïñ¶òïíêëóíïíÈòîðïðòðìíïìò

Alexander Tiskin (Warwick) Semi-local LCS 150 / 164

Page 347: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

ÔßÎÙÛóÍÝßÔÛ Þ×ÑÔÑÙÇ ßÎÌ×ÝÔÛ

ݱ²­»®ª»¼ Ò±²½±¼·²¹ Í»¯«»²½»­ Ø·¹¸´·¹¸¬Í¸¿®»¼ ݱ³°±²»²¬­ ±º λ¹«´¿¬±®§ Ò»¬©±®µ­·² Ü·½±¬§´»¼±²±«­ д¿²¬­É Ñß

Ô¿«®¿ Þ¿¨¬»®ô¿ôï ß´»µ­»§ Ö·®±²µ·²ô¿ôï η½¸¿®¼ Ø·½µ³¿²ô¿ôï Ö¿§ Ó±±®»ô¿ ݸ®·­¬±°¸»® Þ¿®®·²¹¬±²ô¿ 묻® Õ®«­½¸»ô¿

Ò·¹»´ Ðò ܧ»®ô¾ Ê·½µ§ Þ«½¸¿²¿²óɱ´´¿­¬±²ô¿ô½ ß´»¨¿²¼»® Ì·­µ·²ô¼ Ö·³ Þ»§²±²ô¿ô½ Õ¿¬¸»®·²» Ü»²¾§ô¿ô½

¿²¼ Í¿­½¸¿ Ѭ¬¿ôî

¿É¿®©·½µ ͧ­¬»³­ Þ·±´±¹§ Ý»²¬®»ô ˲·ª»®­·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³¾Ó±´»½«´¿® Ñ®¹¿²·­¿¬·±² ¿²¼ ß­­»³¾´§ ·² Ý»´´­ ܱ½¬±®¿´ Ì®¿·²·²¹ Ý»²¬®»ô ˲·ª»®­·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³½Í½¸±±´ ±º Ô·º» ͽ·»²½»­ô ˲·ª»®­·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³¼Ü»°¿®¬³»²¬ ±º ݱ³°«¬»® ͽ·»²½»ô ˲·ª»®­·¬§ ±º É¿®©·½µô ݱª»²¬®§ ÝÊì éßÔô ˲·¬»¼ Õ·²¹¼±³

ݱ²­»®ª»¼ ²±²½±¼·²¹ ­»¯«»²½»­ øÝÒÍ­÷ ·² ÜÒß ¿®» ®»´·¿¾´» °±·²¬»®­ ¬± ®»¹«´¿¬±®§ »´»³»²¬­ ½±²¬®±´´·²¹ ¹»²» »¨°®»­­·±²ò

Ë­·²¹ ¿ ½±³°¿®¿¬·ª» ¹»²±³·½­ ¿°°®±¿½¸ ©·¬¸ º±«® ¼·½±¬§´»¼±²±«­ °´¿²¬ ­°»½·»­ øß®¿¾·¼±°­·­ ¬¸¿´·¿²¿ô °¿°¿§¿ ÅÝ¿®·½¿

°¿°¿§¿Ãô °±°´¿® Åб°«´«­ ¬®·½¸±½¿®°¿Ãô ¿²¼ ¹®¿°» ÅÊ·¬·­ ª·²·º»®¿Ã÷ô ©» ¼»¬»½¬»¼ ¸«²¼®»¼­ ±º ÝÒÍ­ «°­¬®»¿³ ±º ß®¿¾·¼±°­·­

¹»²»­ò Ü·­¬·²½¬ °±­·¬·±²·²¹ô ´»²¹¬¸ô ¿²¼ »²®·½¸³»²¬ º±® ¬®¿²­½®·°¬·±² º¿½¬±® ¾·²¼·²¹ ­·¬»­ ­«¹¹»­¬ ¬¸»­» ÝÒÍ­ °´¿§ ¿ º«²½¬·±²¿´

®±´» ·² ¬®¿²­½®·°¬·±²¿´ ®»¹«´¿¬·±²ò ̸» »²®·½¸³»²¬ ±º ¬®¿²­½®·°¬·±² º¿½¬±®­ ©·¬¸·² ¬¸» ­»¬ ±º ¹»²»­ ¿­­±½·¿¬»¼ ©·¬¸ ÝÒÍ ·­

½±²­·­¬»²¬ ©·¬¸ ¬¸» ¸§°±¬¸»­·­ ¬¸¿¬ ¬±¹»¬¸»® ¬¸»§ º±®³ °¿®¬ ±º ¿ ½±²­»®ª»¼ ¬®¿²­½®·°¬·±²¿´ ²»¬©±®µ ©¸±­» º«²½¬·±² ·­ ¬± ®»¹«´¿¬»

±¬¸»® ¬®¿²­½®·°¬·±² º¿½¬±®­ ¿²¼ ½±²¬®±´ ¼»ª»´±°³»²¬ò É» ·¼»²¬· »¼ ¿ ­»¬ ±º °®±³±¬»®­ ©¸»®» ®»¹«´¿¬±®§ ³»½¸¿²·­³­ ¿®» ´·µ»´§ ¬±

¾» ­¸¿®»¼ ¾»¬©»»² ¬¸» ³±¼»´ ±®¹¿²·­³ ß®¿¾·¼±°­·­ ¿²¼ ±¬¸»® ¼·½±¬­ô °®±ª·¼·²¹ ¿®»¿­ ±º º±½«­ º±® º«®¬¸»® ®»­»¿®½¸ò

×ÒÌÎÑÜËÝÌ×ÑÒ

д¿²¬­ ¿®» ­»­­·´» ±®¹¿²·­³­ô ¿²¼ ¿­ ­«½¸ô ¬¸»§ ®»´§ ±² ¬¸»·® ®»¹ó

«´¿¬±®§ ³¿½¸·²»®§ ¬± ®»½±¹²·¦»ô °®±½»­­ô ¿²¼ ®»­°±²¼ ¬± ­·¹²¿´­

º®±³ ·²¬»®²¿´ ¿²¼ »¨¬»®²¿´ ­¬·³«´·ô »²¿¾´·²¹ ¬¸»³ ¬± ½±°» ©·¬¸

»²ª·®±²³»²¬¿´ ½¸¿²¹»ò ̸» ­«½½»­­ ±º ¬¸»­» ®»­°±²­» ³»½¸¿ó

²·­³­ ¿ºº»½¬­ ¬¸» º«¬«®» ­«®ª·ª¿´ ¿²¼ ¿¼¿°¬¿¬·±² ±º ¬¸» ­°»½·»­ò

Ü»½·­·±² ³¿µ·²¹ ·­ ¿ ½±³°´»¨ °®±½»­­ô ¿²¼ ¬®¿²­½®·°¬·±² º¿½¬±®­

øÌÚ­÷ ¿®» ¿ º«²¼¿³»²¬¿´ °¿®¬ ±º ¬¸» ®»¹«´¿¬±®§ ³¿½¸·²»®§ô ¸»´°·²¹

¬± ·²¬»¹®¿¬» ³«´¬·°´» ½«»­ ¬± §·»´¼ ¿°°®±°®·¿¬» ¼±©²­¬®»¿³ ®»ó

­°±²­»­ øÔ»³±² ¿²¼ ̶·¿²ô îððð÷ò ̸«­ô ¾»¬¬»® «²¼»®­¬¿²¼·²¹ ±º

¬®¿²­½®·°¬·±²¿´ ²»¬©±®µ­ ¿²¼ ¸±© ³«´¬·°´» ­·¹²¿´­ ¿®» ·²¬»¹®¿¬»¼ ¬±

¿ºº»½¬ ¼»½·­·±² ³¿µ·²¹ ©·´´ ¿·¼ ·² ¾»¬¬»® «²¼»®­¬¿²¼·²¹ ±º ¬¸» ©¿§

±®¹¿²·­³­ ¿¼¿°¬ ¿²¼ ­«®ª·ª»ò Ó±®» ¹»²»®¿´´§ô »´«½·¼¿¬·²¹ ¬®¿²ó

­½®·°¬·±²¿´ ®»¹«´¿¬·±² ·­ ª·¬¿´ º±® «²¼»®­¬¿²¼·²¹ ½»´´«´¿® ¿²¼ ¼»ó

ª»´±°³»²¬¿´ °®±½»­­»­ ¿¬ ¬¸» ³±´»½«´¿® ´»ª»´ò ̸»®» ¿®» ·² »¨½»­­

±º îððð µ²±©² ÌÚ­ ·² ß®¿¾·¼±°­·­ ¬¸¿´·¿²¿ øη»½¸³¿²² »¬ ¿´òô îðððå

Ƹ¿²¹ »¬ ¿´òô îðïï÷ô ¿²¼ §»¬ ¬¸» ®»¹«´¿¬±®§ ­»¯«»²½»­ ¬± ©¸·½¸

¬¸»§ ¾·²¼ ®»³¿·² ´¿®¹»´§ »²·¹³¿¬·½ô ©·¬¸ ±²´§ åïëð °´¿²¬ó­°»½· ½

°±­·¬·±²ó­°»½· ½ ­½±®·²¹ ³¿¬®·½»­ øÐÍÍÓ­÷ ½«®®»²¬´§ ·¼»²¬· »¼

¿²¼ ¼»°±­·¬»¼ ·² °«¾´·­¸»¼ ¼¿¬¿¾¿­»­ô ­«½¸ ¿­ ÌÎßÒÍÚßÝô

ÖßÍÐßÎô ¿²¼ ߬¸»²¿ øÉ·²¹»²¼»® »¬ ¿´òô îðððå Ñ�ݱ²²±® »¬ ¿´òô îððëå

Þ®§²» »¬ ¿´òô îððè÷ò ̸»®» ¿®» ¿² ¿¼¼·¬·±²¿´ ìêç ­»¯«»²½» ³±¬·º­ ·²

ÐÔßÝÛ øØ·¹± »¬ ¿´òô ïççç÷ò ̸» °¿«½·¬§ ±º ¼¿¬¿ ·­ »¨¿½»®¾¿¬»¼ô ¿­

³¿²§ ±º ¬¸»­» ÐÍÍÓ­ ¿®» ®»¼«²¼¿²¬ ±® ¾¿­»¼ ±² ´·³·¬»¼ »¨°»®·ó

³»²¬¿´ ¼¿¬¿ò ×¼»²¬·º§·²¹ ¬®¿²­½®·°¬·±² º¿½¬±® ¾·²¼·²¹ ­·¬»­ øÌÚÞÍ­÷

¿³·¼ ¬¸» ª»®§ ²±·­§ ¾¿½µ¹®±«²¼ ±º ²±²½±¼·²¹ ­»¯«»²½» ·­ ²±¬±®·ó

±«­´§ ¼·º ½«´¬å ­·³°´§ ­½¿²²·²¹ ¹»²±³·½ ­»¯«»²½»­ º±® ÐÍÍÓ

³¿¬½¸»­ ®»­«´¬­ ·² ¿² »¨½»°¬·±²¿´´§ ¸·¹¸ º¿´­» ¼·­½±ª»®§ ®¿¬»ò

Ü«» ¬± ¬¸» ¿½¬·±² ±º ²¿¬«®¿´ ­»´»½¬·±² ±² ®¿²¼±³ ¹»²±³·½

³«¬¿¬·±²­ô ²±²½±¼·²¹ ­»¯«»²½»­ ¬¸¿¬ ½±²¬¿·² º«²½¬·±²¿´ ®»¹«ó

´¿¬±®§ »´»³»²¬­ »ª±´ª» ³±®» ­´±©´§ ¬¸¿² ¿¼¶¿½»²¬ ²±²º«²½¬·±²¿´

ÜÒßô ´»¿ª·²¹ ·­´¿²¼­ ±º ½±²­»®ª»¼ ²±²½±¼·²¹ ­»¯«»²½»­ øÝÒÍ­÷ò

̸»®»º±®»ô ¼»¬»½¬·²¹ ¿ ­»¯«»²½» ¬¸¿¬ ¸¿­ ®»³¿·²»¼ ½±²­»®ª»¼

¿½®±­­ »ª±´«¬·±²¿®·´§ ¼·ª»®¹»²¬ ½´¿¼»­ ·³°´·»­ ¬¸¿¬ ¬¸» ­»¯«»²½»

¸¿­ º«²½¬·±²¿´ ­·¹²· ½¿²½»å ¬ ·­ ·­ ¬¸» º±«²¼¿¬·±² ±º °¸§´±¹»²»¬·½

º±±¬°®·²¬·²¹ øÌ¿¹´» »¬ ¿´òô ïçèè÷ò и§´±¹»²»¬·½ º±±¬°®·²¬·²¹ ­·³ó

°´· »­ ¬¸» ¬¿­µ ±º ²¼·²¹ ®»¹«´¿¬±®§ »´»³»²¬­ ¾§ ·¼»²¬·º§·²¹ ÝÒÍ­

·²·¬·¿´´§ «­·²¹ ±®¬¸±´±¹±«­ ­»¯«»²½»­ ¿²¼ ¬¸»² ®» ²·²¹ ¬¸» ­»¿®½¸

­°¿½» ¬± ·²º±®³¿¬·ª» ®»¹·±²­ øÚ®¿¦»® »¬ ¿´òô îððí÷ò Ú±® ÝÒÍ­

«°­¬®»¿³ ±º ¿ ¹»²»�­ ¬®¿²­½®·°¬·±² ­¬¿®¬ ­·¬» øÌÍÍ÷ô ¬¸·­ ½±²­»®ª»¼

º«²½¬·±² ·­ ´·µ»´§ ¬± ¾» ®»¹«´¿¬±®§ò

Ю»ª·±«­ ¿¬¬»³°¬­ ¿¬ ¼·­½±ª»®§ ±º ÝÒÍ ·² ß®¿¾·¼±°­·­ ¸¿ª» «­»¼

°¿®¿´±¹±«­ ¹»²» °¿·®­ øÚ®»»´·²¹ »¬ ¿´òô îððéå ̸±³¿­ »¬ ¿´òô îððé÷ò

̸»­» ­¬«¼·»­ °®±ª·¼» ·²º±®³¿¬·±² ±² ·²¬®¿¹»²±³·½ ø°¿®¿´±¹±«­÷

ÝÒÍ­ ¿­­±½·¿¬»¼ ©·¬¸ ¸±³±´±¹­ ø¿®·­·²¹ º®±³ ¼«°´·½¿¬·±² »ª»²¬­÷

¾«¬ ¼± ²±¬ ¿¬¬»³°¬ ¬± ¼·­½±ª»® ·²¬»®¹»²±³·½ ø±®¬¸±´±¹±«­÷ ÝÒÍ­

«­·²¹ ±®¬¸±´±¹­ò ͬ«¼·»­ ½±²­·¼»®·²¹ ±®¬¸±´±¹­ ·² ®»´¿¬»¼ °´¿²¬

­°»½·»­ ¸¿ª» ­± º¿® ¾»»² ±º ´·³·¬»¼ ­½±°»ô º±½«­·²¹ ±² ±²´§

ï̸»­» ¿«¬¸±®­ ½±²¬®·¾«¬»¼ »¯«¿´´§ ¬± ¬¸·­ ©±®µòîß¼¼®»­­ ½±®®»­°±²¼»²½» ¬± ­ò±¬¬à©¿®©·½µò¿½ò«µ

̸» ¿«¬¸±® ®»­°±²­·¾´» º±® ¼·­¬®·¾«¬·±² ±º ³¿¬»®·¿´­ ·²¬»¹®¿´ ¬± ¬¸» ²¼·²¹­

°®»­»²¬»¼ ·² ¬¸·­ ¿®¬·½´» ·² ¿½½±®¼¿²½» ©·¬¸ ¬¸» °±´·½§ ¼»­½®·¾»¼ ·² ¬¸»

ײ­¬®«½¬·±²­ º±® ß«¬¸±®­ ø©©©ò°´¿²¬½»´´ò±®¹÷ ·­æ Í¿­½¸¿ Ѭ¬ ø­ò±¬¬à

©¿®©·½µò¿½ò«µ÷òÉ Ñ²´·²» ª»®­·±² ½±²¬¿·²­ É»¾ó±²´§ ¼¿¬¿òÑß Ñ°»² ß½½»­­ ¿®¬·½´»­ ½¿² ¾» ª·»©»¼ ±²´·²» ©·¬¸±«¬ ¿ ­«¾­½®·°¬·±²ò

©©©ò°´¿²¬½»´´ò±®¹ñ½¹·ñ¼±·ñïðòïïðëñ¬°½òïïîòïðíðïð

̸» д¿²¬ Ý»´´ô ʱ´ò îìæ íçìç�íçêëô ѽ¬±¾»® îðïîô ©©©ò°´¿²¬½»´´ò±®¹= îðïî ß³»®·½¿² ͱ½·»¬§ ±º д¿²¬ Þ·±´±¹·­¬­ò ß´´ ®·¹¸¬­ ®»­»®ª»¼ò

Alexander Tiskin (Warwick) Semi-local LCS 151 / 164

Page 348: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

E-mail: Print:

by Katy Edgington

06 December 2012

We can study

selected

conserved

regions

experimentally

to understand

some of the

most ancient

regulatory

mechanisms

underlying plant

development.

Dr Sascha Ott

Researchers from the University of Warwick have found that hundreds of regionsof non-coding DNA are conserved among various plant species that have beenevolving separately for around 100 million years. The findings, which have beendetailed in the journal The Plant Cell, have led the team to believe that thesesequences, although they are not genes, could provide vital information aboutplant development and functioning.

Computational analysis of the genomes of papaya (Carica papaya), poplar(Populus trichocarpa), thale cress (Arabidopsis thaliana) and grape (Vitis vinifera)showed that all four species contained conserved non-coding sequences (CNSs)which may be important for determining whether neighbouring genes areexpressed in the organism.

Dr Sascha Ott, Associate Professor at Warwick Systems Biology Centre and oneof the report’s co-authors, answered ScienceOmega.com’s questions about thestudy, explaining the rationale behind the researchers’ choice of species, as wellas the implications and potential applications of this research as a short cut forbiologists and crop scientists wishing to address such issues as food security…

Why were papaya, poplar, Arabidopsis and grape the species chosen foranalysis?We initially analysed DNA sequences for only a handful of genes in order toestablish which species would be most amenable to our comparative genomicsapproach. On the one hand, we needed species that were sufficiently distant fromour model organism, Arabidopsis, for non-functional DNA sequences to have lostsequence similarity by accumulating many mutations.

On the other hand, we needed species to be sufficiently close such thatnon-coding functional sequences (which accumulate mutations at a slower rate)can still be detected. The choice of species was further restricted by the availabilityof whole-genome sequences at the time we started our project. Some relevantspecies such as potato, tomato, soybean and cotton are closely related to some ofthe species that we have already analysed and we know that these species shareat least some of the conserved non-coding regions – so these species make greattargets for expanding our work.

Did you expect to find that the non-coding sequences were so oftenconserved?No, we did not. In fact after our initial analyses and after reviewing previouslyexisting literature we were concerned that we may not find very much at all.Luckily, it turned out that a large set of genes have conserved non-codingsequences neighbouring the genes.

How much do we know about the function(s) of these regions of plant DNA?There are examples of non-coding DNA regions that have previously been studiedintensively. For example, Alabadi et al found a short region of DNA that isnecessary and sufficient to drive the time-of-day dependent expression of a keygene called TOC1. We previously found that the region they identified is closelymatching with one of the conserved regions that we can identify with our method.

The general picture we have from studies on other organisms is that conservednon-coding regions often control the spatial, temporal, and condition-specificexpression of genes that they are close to. The function of each of the regions we

Conserved DNA could prove crop science shortcut MOST READ COMMENTS

High IQ means intelligent informationfiltering

Minimum alcohol pricing: who winsand who loses?

Graphene ink could be the future forfoldable electronics

Inside the mind of a serial killer: apsychologist's perspective

primate research Dutch healthcare

HIV capsid high IQ

celebrating deafness art

appreciation space angels

spent nuclear fuel

As we are not nocturnal animalsby nature, then we must benefit fromsunlight, in several ways. It then

follows that we can deal withsunlight - maybe we need to be

more in touch with how to deal withexposure. This research shows thatwe do not know all the ways that we

use sunlight naturally, which isprobably true of other environmental

factors. I will be very interested inoutcomes of further research. Is

there any research about sunlightand cognition?

Commented Alida Bedford onSunlight benefits greater thanskin cancer risk?

Conserved DNA could prove crop science shortcut - Science Omega http://www.scienceomega.com/article/732/conserved-dna-could-prove-...

1 of 2 31/05/2013 00:19

Alexander Tiskin (Warwick) Semi-local LCS 152 / 164

Page 349: A superglue for string comparison

Beyond semi-localityFixed-length substring LCS

Publications:

[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) inPlants. Plant Journal.

[Baxter+: 2012] Conserved Noncoding Sequences Highlight SharedComponents of Regulatory Networks in Dicotyledonous Plants. Plant Cell.

Web service: http://wsbc.warwick.ac.uk/ears

Warwick press release (with video): http://www2.warwick.ac.uk/

newsandevents/pressreleases/discovery_of_100

Future work: genomic repeats; genome-scale comparison

Alexander Tiskin (Warwick) Semi-local LCS 153 / 164

Page 350: A superglue for string comparison

Beyond semi-localityQuasi-local LCS and spliced alignment

Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem

Give LCS score for every prescribed substring of a vs every subsring of b

Quasi-local LCS: running time

O(m2n) repeated seaweed combingO(mn log2 m) [T: unpublished]

Alexander Tiskin (Warwick) Semi-local LCS 154 / 164

Page 351: A superglue for string comparison

Beyond semi-localityQuasi-local LCS and spliced alignment

Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem

Give LCS score for every prescribed substring of a vs every subsring of b

Quasi-local LCS: running time

O(m2n) repeated seaweed combingO(mn log2 m) [T: unpublished]

Alexander Tiskin (Warwick) Semi-local LCS 154 / 164

Page 352: A superglue for string comparison

Beyond semi-localityQuasi-local LCS and spliced alignment

Quasi-local LCS: the algorithm

Compute seaweed matrices for canonical substrings of a vs b recursively

Compute seaweed matrices for prescribed substrings of a vs b, using thedecomposition of each prescribed substring into canonical substrings

Both stages use matrix �-multiplication, time O(mn log2 m)

Alexander Tiskin (Warwick) Semi-local LCS 155 / 164

Page 353: A superglue for string comparison

Beyond semi-localityQuasi-local LCS and spliced alignment

The spliced alignment problem

Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score

Describes gene assembly from candidate exons, given a known similar gene

Assume m = n; let s = total size of prescribed substrings

Spliced alignment: running time

O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]

Alexander Tiskin (Warwick) Semi-local LCS 156 / 164

Page 354: A superglue for string comparison

Beyond semi-localityQuasi-local LCS and spliced alignment

The spliced alignment problem

Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score

Describes gene assembly from candidate exons, given a known similar gene

Assume m = n; let s = total size of prescribed substrings

Spliced alignment: running time

O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]

Alexander Tiskin (Warwick) Semi-local LCS 156 / 164

Page 355: A superglue for string comparison

Beyond semi-localityLocal alignment

Assume a fixed set of character alignment scores

The Smith–Waterman local alignment problem

For every prefix of a and every prefix of b, give a pair of respective suffixesof these prefixes that have the highest alignment score against each other

Smith–Waterman local alignment: running time

O(m2n2) [naive]O(mn) [Smith, Waterman: 1981]

Alexander Tiskin (Warwick) Semi-local LCS 157 / 164

Page 356: A superglue for string comparison

Beyond semi-localityLocal alignment

Smith–Waterman local alignment [SW: 1981]

Sweep cells in any �-compatible order

Cell update: time O(1)

Overall time O(mn)

Smith–Waterman local alignment maximises alignment score irrespectiveof length

May not always be the most relevant alignment, e.g. may be

too short (would prefer a longer alignment even with a lower score)

too long (would prefer a significantly shorter alignment with slightlylower score)

Alexander Tiskin (Warwick) Semi-local LCS 158 / 164

Page 357: A superglue for string comparison

Beyond semi-localityLocal alignment

The bounded substring alignment problem

For every prefix of string a and every prefix of string b, give a pair ofrespective suffixes of these prefixes of length at least/at most t that havethe highest alignment score against each other

Solved efficiently using window-substring alignment as a subroutine, e.g.for alignments of length at least t:

solve the window-substring alignment problem, obtaining implicitalignment score for every t-window in a against every substring in b;

obtain explicit optimal alignment score for every t-window in aagainst all substrings in b terminating in a given position;

extend locally optimal window-substring alignment scores to optimalsubstring-substring scores as in Smith–Waterman algorithm

Alexander Tiskin (Warwick) Semi-local LCS 159 / 164

Page 358: A superglue for string comparison

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 Sparse string comparison

7 Compressed string comparison

8 Parallel string comparison

9 The transposition networkmethod

10 Beyond semi-locality

Alexander Tiskin (Warwick) Semi-local LCS 160 / 164

Page 359: A superglue for string comparison

Conclusions and open problems (1)

Implicit unit-Monge matrices:

equivalent to seaweed braids

distance multiplication in time O(n log n)

Semi-local LCS problem:

representation by implicit unit-Monge matrices

generalisation to rational alignment scores

Seaweed combing and micro-block speedup:

a simple algorithm for semi-local LCS

semi-local LCS in time O(mn), and even slightly o(mn)

Sparse string comparison:

fast LCS on permutation strings

fast max-clique in a circle graph

Alexander Tiskin (Warwick) Semi-local LCS 161 / 164

Page 360: A superglue for string comparison

Conclusions and open problems (2)

Compressed string comparison:

LCS and approximate matching on plain pattern against GC-text

Parallel string comparison:

comm- and sync- efficient parallel LCS

Transposition networks:

simple interpretation of parameterised and bit-parallel LCS

dynamic LCS

Beyond semi-locality:

window alignment and sparse spliced alignment

Alexander Tiskin (Warwick) Semi-local LCS 162 / 164

Page 361: A superglue for string comparison

Conclusions and open problems (3)

Open problems (theory):

real-weighted alignment?

max-clique in circle graph: O(n log2 n) O(n log n)?

compressed LCS: other compression models?

parallel LCS: comm per processor O(n log p

p1/2

) O

(n

p1/2

)?

lower bounds?

Open problems (biological applications):

real-weighted alignment?

good pre-filtering (e.g. q-grams)?

application for multiple alignment? (NB: NP-hard)

Alexander Tiskin (Warwick) Semi-local LCS 163 / 164

Page 362: A superglue for string comparison

Thank you

Alexander Tiskin (Warwick) Semi-local LCS 164 / 164