Download - Reducing the Space Requirement of LZ-index
![Page 1: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/1.jpg)
Reducing the Space Requirement of LZ-index
Diego Arroyuelo1, Gonzalo Navarro1, and Kunihiko Sadakane2
1Dept. of Computer Science, Univ. Of Chile2Dept. of Computer Science and Comunnication Engineering, Kyushu Univ.
Barcelona – July 7, 2006
![Page 2: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/2.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 3: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/3.jpg)
Problem definition
The full-text search problem: to find the occ occurrences of a pattern P[1…m] in a text T[1…u]
To provide fast access to T requiring little space we use compressed full-text self-indexes:
replace T and in addition give indexed access to it, and
take space proportional to the compressed text size
(O(uHk(T)) bits)
Main motivation: to store the indexes of very large texts entirely in
main memory
The k-th order empirical entropy of T
Hk(T) ≤ Hk-1(T) ≤ … ≤ H0(T) ≤ log
![Page 4: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/4.jpg)
Our results
Space: 4uHk(T)+o(ulog) bits, k =
o(logu)
Reporting: O(m3log + (m+occ)logu)
Displaying: O(log)
LZ-index [Navarro, 2004] Our Results
(2+)uHk(T)+o(ulog) bits
for any constant 0 < < 1
O(m2log m + (m+occ)logu)
O(/ logu) (optimal)
(1+)uHk(T)+o(ulog) bits
O(m2) (average case), for
m ≥ 2logu
But also The main drawback of LZ-index is the
factor 4 in the space complexity
LZ-index is faster to report and to display
(very important for a self-index!)
![Page 5: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/5.jpg)
Our results in context
Our data structures: Size O(uHk(T)) bits O(logu) time per occurrence reported, if = (polylog(u))
There are competing schemes requiring the same or better complexity for reporting
The case = (polylog(u)) represents moderate-size alphabets and is very common in practice, but does not fit in competing schemes
![Page 6: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/6.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 7: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/7.jpg)
The LZ-index (a review)
LZTrie RevTrie
Node
LZ78 parsing of T
Range
We don’t need to store the text!
![Page 8: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/8.jpg)
LZTrie: par: the balanced parentheses representation of LZTrie (2n+o(n) bits)
lets: the symbols labelling the arcs of LZTrie (in preorder) (nlog bits)
ids: the phrase identifiers in preorder (nlogn bits)
RevTrie:
rpar: the balanced parentheses representation of RevTrie (4n+o(n) bits)
rids: the phrase identifiers in preorder (nlogn bits)
Node: an array requiring nlog(2n) = nlogn + n bits
Range: implemented using [Chazelle, 1988], requiring nlogn(1+o(1)) bits
Succinct representation of the data structures
Assume n is the number of phrases in the LZ78 parsing of T
![Page 9: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/9.jpg)
We have four nlogn-bit terms
As nlogn = uHk(T)+o(ulog), for k = o(logu),
the LZ-index requires
4nlogn(1+o(1)) = 4uHk(T) + o(ulog) bits, for k = o(logu)
To reduce the space requirement we must reduce the
number of nlogn-bit terms in the index
Succinct representation of the data structures
![Page 10: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/10.jpg)
Occurrences of Type 1
Occurrences of Type 2
Occurrences of Type 3
Reporting time: O(m3log + (m+occ)logn)
Search Algorithm
Bk-1 Bk … Bl Bl+1
![Page 11: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/11.jpg)
Solving Occurrences of Type 1
Shortest possible LZ78 phrases containing P
LZTrie
P
PP Subtrees containing ocurrences of type 1
By LZ78, P is a suffix of such phrases
![Page 12: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/12.jpg)
As P is a suffix of such phrases, Pr is a prefix of the corresponding
reverse phrases
We need the Reverse Trie (RevTrie) to solve this problem
Solving Occurrences of Type 1
PrRevTrieLZTrie
P
PP
![Page 13: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/13.jpg)
Search for [x,y][x’,y’] in Range
For every pair (k, k+1) found, report k
Solving Occurrences of Type 2
Pr1
RevTrieLZTrie
P2
P2P1
x
x’
y
y’
![Page 14: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/14.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 15: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/15.jpg)
LZ-index as a Navigation Scheme In practice Range is replaced by RNode (phrase id RevTrie node)
Occurrences of type 2:
We have no worst-case guarantees at search time
Average time for type 2 occs: O(n/m/2) (O(1) for m ≥ 2logn)
RNode
Node
Pr1
RevTrieLZTrie
P2
P2P1
![Page 16: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/16.jpg)
Original Navigation Scheme
But the scheme is redundant…
We study how to reduce the redundancy in the LZ-index
When we replace Range by RNode, we get a “navigation” scheme
![Page 17: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/17.jpg)
Alternative Navigation Scheme
Search algorithm remains the same…
Inverse permutations represented with
Munro et al.
Space requirement: (2+)uHk + o(ulog)
bits
O(m2) (average case), for m ≥ 2logn
![Page 18: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/18.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 19: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/19.jpg)
Suffix Links in RevTrie
Can we reduce the space requirement of LZ-index to
(1+)uHk+o(ulog) bits?
Can we reduce the space requirement while retaining worst-
case guarantees in the search process?
We are going to compress the R mapping
![Page 20: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/20.jpg)
Suffix Links in RevTrie
Definition 1: We define function as a suffix link in RevTrie
(i) = R-1(parentLZ(R[i]))
x
RevTrie
i
(i)
R[i]
LZTrie
x
axr
a
if we follow a suffix link in RevTrie, we are “going to the parent” in LZTrie
![Page 21: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/21.jpg)
Suffix Links in RevTrie
20621751420143216149020
17161514131211109876543210
__rrppllldbbaaaa$
1716151413121110987654321
L
R[11] =??
1 23
![Page 22: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/22.jpg)
Suffix Links in RevTrie
We can compute R using But, what is the difference in space requirement? (both R
and require, in principle, nlogn bits) We can prove the following lemma for function
![Page 23: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/23.jpg)
We replace the nlogn-bit representation of R by a representation of requiring
nH0(lets) + O(nloglog) + O(log) + n + o(n)
To compute R in O(1/) time we store n values of R, requiring nlogn extra bits
R-1 can be computed in O(1/2) time
Suffix Links in RevTrie
![Page 24: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/24.jpg)
Suffix Links in RevTrie
Yes, we can reduce the space requirement of LZ-index to
(1+)uHk+o(ulog) bits
![Page 25: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/25.jpg)
Suffix Links in RevTrie
We can add Range to get worst case guarantees in the search process, requiring nlogn extra bits
Yes, we can reduce the space requirement of LZ-index to (2+)uHk+o(ulog) bits, retaining worst
case guarantees at search time
![Page 26: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/26.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 27: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/27.jpg)
The xbw transform [Ferragina et al., 2005] is a succinct tree representation requiring 2nlog+O(n) bits and allowing operations:
parent (O(1) time) child(x, i) (O(1) time) child(x, a) (O(1) time) Subpath queries (O(m) time)
As we can perform prefix and suffix searching, we can do the work of both LZTrie and RevTrie only with xbw!
xbw LZ-index
P
PP
Subpath search with string P
![Page 28: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/28.jpg)
xbw LZ-index
(()()())()(()())(())
Balanced Parentheses LZTrie
S
xbw LZTrie
Slast
i
iPos
+
Pos-1
ids
In principle: (3+)uHk(T)+ o(ulog)
bits
Range
xbw positions
preorder positions
![Page 29: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/29.jpg)
xbw LZ-index
(()()())()(()())(())
Balanced Parentheses LZTrie
S xbw LZTrie
Slast
i
j
Pos’
ids
Pos[i]
(2+)uHk(T)+ o(ulog) bits
We store one out of O(1/) values of Pos
![Page 30: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/30.jpg)
Occurrences of Type 1: using the xbw (subpath search with Pr), and then mapping to the parentheses LZTrie
Occurrences of Type 2: subpath search for Pr1 and search
(using child from the root) for P2. Then use the corresponding xbw and preorder ranges to search in
Range
Ocurrences of Type 3:mostly as with the original LZ-index
Occurrences of Type 2 can be solved as Occurrences of Type 3 (we don’t need Range!)
xbw LZ-indexWe have achieved Theorem 1 and 2 with radically different means!!
![Page 31: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/31.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 32: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/32.jpg)
The approach of [Sadakane and Grossi, 2006] to display any text substring of length (logu) in constant time can be adapted to our indexes
Displaying text substrings
![Page 33: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/33.jpg)
Outline
Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions
![Page 34: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/34.jpg)
We have studied the reduction of the space requirement of LZ-index
Two different approaches
In either case we achieve (2+)uHk(T) + o(ulog) to index T[1…u], k = o(logu)
The search time is improved to O(m2logm + (m+occ)logn) (worst case)
Conclusions
Navigational scheme
xbw + bp LZTrie
![Page 35: Reducing the Space Requirement of LZ-index](https://reader036.vdocuments.us/reader036/viewer/2022062422/568140c0550346895dac853e/html5/thumbnails/35.jpg)
We also define indexes requiring (1+)uHk(T) + o(ulog) to index T[1…u], k = o(logu)
O(m2) average-case time if m ≥ 2logn The time to display a context of length around any text
position is also improved to the optimal O(/logu) We also remove some restrictions of the original LZ-index
(see the paper)
Conclusions