computing longest common substring and all palindromes from compressed strings
DESCRIPTION
Computing longest common substring and all palindromes from compressed strings. Wataru Matsubara 1 , Shunsuke Inenaga 2 , Akira Ishino 1 , Ayumi Shinohara 1 , Tomoyuki Nakamura 1 , Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/1.jpg)
Computing longest common substring and all palindromes
from compressed stringsWataru Matsubara1, Shunsuke Inenaga2,
Akira Ishino1, Ayumi Shinohara1, Tomoyuki Nakamura1, Kazuo Hashimoto1
1Graduate School of Information SciencesTohoku University, Japan
2Department of Computer Science and Communication Engineering,
Kyushu University, Japan
![Page 2: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/2.jpg)
Background and motivations
![Page 3: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/3.jpg)
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
input text
![Page 4: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/4.jpg)
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
input text
find palindromes
output
mmisizziprefrepiborroworrobwasitabarorabatisowoo :
![Page 5: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/5.jpg)
What is compressed string algorithm?
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
A palindrome is a symmetric string. It is interesting on their own as word puzzles.For example, “I prefer pi“, ”Borrow or rob?“, and “Was it a bar or a bat I saw?“ and so on. :
find palindromes
output
mmisizziprefrepiborroworrobwasitabarorabatisowoo :
decompress
e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…
e)%eARY)(ReJD)OIHOIFEnkkdiwe02kfo)J”LPEPJ9wEOW*# eO…
compressed text One solution would be to decompress the compressed text.
The decompressed size can be exponentially large with respect to the compressed size.
decompressed text
![Page 6: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/6.jpg)
Goal of algorithms for Compressed strings
• Process the compressed text without decompression.
• Processing time should be polynomial in n.– Decompressed size can be exponentially large
with respect to n.
n : the size of compressed text
![Page 7: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/7.jpg)
Compressed schemes
• run-length encoding• Lempel-Ziv• grammar based compression :
Straight Line Program
[Rytter2003] Resulting achieve of most practical
compression methods can be transformed into SLP generating the same original text.
[Rytter2003] Resulting achieve of most practical
compression methods can be transformed into SLP generating the same original text.
![Page 8: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/8.jpg)
SLP TSLP T
T : sequence of assignments X1 = expr1 ; X2 = expr2; … ; Xn = exprn;
Xk : variable,a ( a Xi Xj ( i, j < k ).
exprk :
Definition of Straight Line Program (SLP)
SLP T for string w is a CFG in Chomsky normal form s.t. L(T) = {w}.
![Page 9: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/9.jpg)
Straight Line Program (SLP)Example
X1 = a
X2 = b
X3 = X1X2
X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5
n
NN = O(2n)
T =
SLP
![Page 10: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/10.jpg)
Straight Line Program (SLP)Example
X1 = a
X2 = b
X3 = X1X2
X4 = X3X1
X5 = X3X4
X6 = X5X5
X7 = X4X6
X8 = X7X5
n
NN = O(2n)
T =
SLP
X8
X7 X5
![Page 11: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/11.jpg)
Efficient algorithms for compressed strings
• substring matching– Karpinski et al (1996) O(n4logn) time
– Miyazaki et al (1997) O(n4) time
– Lifshits (2006) O(n3) time
• minimum period– Karpinski et al (1996) O(n4logn) time
– Lifshits (2006) O(n3logN) time
• all squares– Gasieniec et al (1994) O(n6log5N) time
![Page 12: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/12.jpg)
Hardness results
• Subsequence pattern matching– Lifshits and Lohrey (2006) NP-hard
• Longest common subsequence– Lifshits and Lohrey (2006) NP-hard
• Hamming distance
– Lifshits (2007) #P-complete
Is there any reasonable comparison measurement for compressed strings?
![Page 13: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/13.jpg)
a b a a b a
a a b b a a
String comparison measures
a b a a b a
a a b b a a
Hamming distance
Longest common subsequence
Longest common substring
#P-comprete[Lifshits 07]
NP-hard[Lifshits and Lohrey06]
??
O(N)uncompressed text
compressed text
O(N2 / logN) O(N)
we solve this problemwe solve this problem
a b a a b a
a a b b a a
![Page 14: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/14.jpg)
Our results
![Page 15: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/15.jpg)
ProblemGiven two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S).
LCStr(T, S) : the length of longest common substring of T and Sn : the total size of the input SLP
Our Result1: Longest Common Substring
TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.
TheoremLCStr(T, S) can be computed in OO((nn44loglognn)) timeusing OO((nn33)) space.
![Page 16: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/16.jpg)
ProblemGiven SLP T, compute (compressed representations) the set of all palindromes of T.
n : the size of SLP TN : the length of original text T (note that N = O(2n)
Previous best result: O(n5log4N) time
Our Result2: palindromes
[Gasienec et al 1996]
TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.
TheoremThe problem can be solved in OO((nn44)) time using OO((nn22)) space.
![Page 17: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/17.jpg)
Details of our algorithm
Computing longest common substringComputing palindromes (omitted in this talk)
![Page 18: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/18.jpg)
Property of common substrings (1/3)
• For each common substring Z of string S and T, there always exists a variable Xi = XlXr and Yj = YLYR
such that:– Z is a common substring of Xi and Yj
– Z contains an overlap between Xl and YR
common substringcommon substring
ZZ
ZZ
XiXl Xr
Yj
YL YR
ww
OverlapOverlap
![Page 19: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/19.jpg)
Property of common substrings (2/3)
• For each common substring Z of string S and T, there always exists a string w such that:– w is a substring of Z– w is an overlap of variables of S and T
ww
XiXl Xr
Yj
YL YR
OverlapOverlap
![Page 20: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/20.jpg)
Property of common substrings (3/3)
• For each common substring Z of string S and T, there always exists a string w such that:– Z can be calculate by expanding w
common substringcommon substring
wwZZ
ZZ
XiXl Xr
Yj
YL YR Extend ProcessExtend Process
OverlapOverlap
![Page 21: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/21.jpg)
For any strings X, Y,
Overlaps (OL)
the set of the lengths of overlaps of X and Y.
the set of the lengths of overlaps of X and Y.
X
Y
![Page 22: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/22.jpg)
a a b a a b a
OverlapsExample
OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl
a b a a b a a b a bYR
a b a a b a a b a bYR
a b a a b a a b a bYR
![Page 23: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/23.jpg)
Computing Overlaps[Karpinski et al 1996]
LemmaFor any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions.
Xi
Yj
Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n4logn) time and O(n3) space.
![Page 24: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/24.jpg)
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
![Page 25: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/25.jpg)
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
![Page 26: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/26.jpg)
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
![Page 27: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/27.jpg)
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
matchmatch
![Page 28: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/28.jpg)
a b a ∈ OL(Xl, YR)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YRYL
mismatchmismatch
![Page 29: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/29.jpg)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YrYl
a b a ∈ OL(Xl, YR)
mismatchmismatch
![Page 30: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/30.jpg)
How to extend overlaps
a a a b a b a b a a b a b a b b
a a b a a b a a b a b a b a a b a
Xl Xr
Xi
Yj
YrYl
a b a ∈ OL(Xl, YR)
We are not allowed to process character by character.
We are not allowed to process character by character.
![Page 31: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/31.jpg)
First-mismatch function[Karpinski et al 1996]
input : SLP variables Xi and Yj , integer k
output : position of first mismatch
Mismatch
Mismatch
k Yj
a b a b a a b a b a a b
Xi
a b a b a b a a b a
p p [p]}-1
![Page 32: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/32.jpg)
First-mismatch function[Karpinski et al 1996]
Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k) can be computed in O(nlogn) time.
![Page 33: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/33.jpg)
Extending overlaps using FM function
Lemma Extending overlaps can be done by O(n) calls of FM function.
![Page 34: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/34.jpg)
O(n2) items O(n2) items
pseudo-code Computing longest common substring
O(n) calls of FM function.O(n) calls of FM function.
O(nlogn) timesO(nlogn) times
Totally, LCStr (S, T) can be computed inO(n2×n×nlogn ) = O ( n4logn ) time.
![Page 35: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/35.jpg)
Conclusions
• Computing longest common substring from compressed string– O(n4logn) time and O(n3) space
• Computing all palindromes from compressed string– O(n4) time and O(n2) space
![Page 36: Computing longest common substring and all palindromes from compressed strings](https://reader035.vdocuments.us/reader035/viewer/2022070405/56813ff4550346895dab1006/html5/thumbnails/36.jpg)
Thank you for your attention.