reducing the space requirement of lz-index

37
Reducing the Space Requirement of LZ- index Diego Arroyuelo 1 , Gonzalo Navarro 1 , and Kunihiko Sadakane 2 1 Dept. of Computer Science, Univ. Of Chile 2 Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006

Upload: kasi

Post on 09-Jan-2016

20 views

Category:

Documents


1 download

DESCRIPTION

Reducing the Space Requirement of LZ-index. Diego Arroyuelo 1 , Gonzalo Navarro 1 , and Kunihiko Sadakane 2 1 Dept. of Computer Science, Univ. Of Chile 2 Dept. of Computer Science and Comunnication Engineering, Kyushu Univ. Barcelona – July 7, 2006. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reducing the Space Requirement of LZ-index

Reducing the Space Requirement of LZ-index

Diego Arroyuelo1, Gonzalo Navarro1, and Kunihiko Sadakane2

1Dept. of Computer Science, Univ. Of Chile2Dept. of Computer Science and Comunnication Engineering, Kyushu Univ.

Barcelona – July 7, 2006

Page 2: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 3: Reducing the Space Requirement of LZ-index

Problem definition

The full-text search problem: to find the occ occurrences of a pattern P[1…m] in a text T[1…u]

To provide fast access to T requiring little space we use compressed full-text self-indexes:

replace T and in addition give indexed access to it, and

take space proportional to the compressed text size

(O(uHk(T)) bits)

Main motivation: to store the indexes of very large texts entirely in

main memory

The k-th order empirical entropy of T

Hk(T) ≤ Hk-1(T) ≤ … ≤ H0(T) ≤ log

Page 4: Reducing the Space Requirement of LZ-index

Our results

Space: 4uHk(T)+o(ulog) bits, k =

o(logu)

Reporting: O(m3log + (m+occ)logu)

Displaying: O(log)

LZ-index [Navarro, 2004] Our Results

(2+)uHk(T)+o(ulog) bits

for any constant 0 < < 1

O(m2log m + (m+occ)logu)

O(/ logu) (optimal)

(1+)uHk(T)+o(ulog) bits

O(m2) (average case), for

m ≥ 2logu

But also The main drawback of LZ-index is the

factor 4 in the space complexity

LZ-index is faster to report and to display

(very important for a self-index!)

Page 5: Reducing the Space Requirement of LZ-index

Our results in context

Our data structures: Size O(uHk(T)) bits O(logu) time per occurrence reported, if = (polylog(u))

There are competing schemes requiring the same or better complexity for reporting

The case = (polylog(u)) represents moderate-size alphabets and is very common in practice, but does not fit in competing schemes

Page 6: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 7: Reducing the Space Requirement of LZ-index

The LZ-index (a review)

LZTrie RevTrie

Node

LZ78 parsing of T

Range

We don’t need to store the text!

Page 8: Reducing the Space Requirement of LZ-index

LZTrie: par: the balanced parentheses representation of LZTrie (2n+o(n) bits)

lets: the symbols labelling the arcs of LZTrie (in preorder) (nlog bits)

ids: the phrase identifiers in preorder (nlogn bits)

RevTrie:

rpar: the balanced parentheses representation of RevTrie (4n+o(n) bits)

rids: the phrase identifiers in preorder (nlogn bits)

Node: an array requiring nlog(2n) = nlogn + n bits

Range: implemented using [Chazelle, 1988], requiring nlogn(1+o(1)) bits

Succinct representation of the data structures

Assume n is the number of phrases in the LZ78 parsing of T

Page 9: Reducing the Space Requirement of LZ-index

We have four nlogn-bit terms

As nlogn = uHk(T)+o(ulog), for k = o(logu),

the LZ-index requires

4nlogn(1+o(1)) = 4uHk(T) + o(ulog) bits, for k = o(logu)

To reduce the space requirement we must reduce the

number of nlogn-bit terms in the index

Succinct representation of the data structures

Page 10: Reducing the Space Requirement of LZ-index

Occurrences of Type 1

Occurrences of Type 2

Occurrences of Type 3

Reporting time: O(m3log + (m+occ)logn)

Search Algorithm

Bk-1 Bk … Bl Bl+1

Page 11: Reducing the Space Requirement of LZ-index

Solving Occurrences of Type 1

Shortest possible LZ78 phrases containing P

LZTrie

P

PP Subtrees containing ocurrences of type 1

By LZ78, P is a suffix of such phrases

Page 12: Reducing the Space Requirement of LZ-index

As P is a suffix of such phrases, Pr is a prefix of the corresponding

reverse phrases

We need the Reverse Trie (RevTrie) to solve this problem

Solving Occurrences of Type 1

PrRevTrieLZTrie

P

PP

Page 13: Reducing the Space Requirement of LZ-index

Search for [x,y][x’,y’] in Range

For every pair (k, k+1) found, report k

Solving Occurrences of Type 2

Pr1

RevTrieLZTrie

P2

P2P1

x

x’

y

y’

Page 14: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 15: Reducing the Space Requirement of LZ-index

LZ-index as a Navigation Scheme In practice Range is replaced by RNode (phrase id RevTrie node)

Occurrences of type 2:

We have no worst-case guarantees at search time

Average time for type 2 occs: O(n/m/2) (O(1) for m ≥ 2logn)

RNode

Node

Pr1

RevTrieLZTrie

P2

P2P1

Page 16: Reducing the Space Requirement of LZ-index

Original Navigation Scheme

But the scheme is redundant…

We study how to reduce the redundancy in the LZ-index

When we replace Range by RNode, we get a “navigation” scheme

Page 17: Reducing the Space Requirement of LZ-index

Alternative Navigation Scheme

Search algorithm remains the same…

Inverse permutations represented with

Munro et al.

Space requirement: (2+)uHk + o(ulog)

bits

O(m2) (average case), for m ≥ 2logn

Page 18: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 19: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

Can we reduce the space requirement of LZ-index to

(1+)uHk+o(ulog) bits?

Can we reduce the space requirement while retaining worst-

case guarantees in the search process?

We are going to compress the R mapping

Page 20: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

Definition 1: We define function as a suffix link in RevTrie

(i) = R-1(parentLZ(R[i]))

x

RevTrie

i

(i)

R[i]

LZTrie

x

axr

a

if we follow a suffix link in RevTrie, we are “going to the parent” in LZTrie

Page 21: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

20621751420143216149020

17161514131211109876543210

__rrppllldbbaaaa$

1716151413121110987654321

L

R[11] =??

1 23

Page 22: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

We can compute R using But, what is the difference in space requirement? (both R

and require, in principle, nlogn bits) We can prove the following lemma for function

Page 23: Reducing the Space Requirement of LZ-index

We replace the nlogn-bit representation of R by a representation of requiring

nH0(lets) + O(nloglog) + O(log) + n + o(n)

To compute R in O(1/) time we store n values of R, requiring nlogn extra bits

R-1 can be computed in O(1/2) time

Suffix Links in RevTrie

Page 24: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

Yes, we can reduce the space requirement of LZ-index to

(1+)uHk+o(ulog) bits

Page 25: Reducing the Space Requirement of LZ-index

Suffix Links in RevTrie

We can add Range to get worst case guarantees in the search process, requiring nlogn extra bits

Yes, we can reduce the space requirement of LZ-index to (2+)uHk+o(ulog) bits, retaining worst

case guarantees at search time

Page 26: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 27: Reducing the Space Requirement of LZ-index

The xbw transform [Ferragina et al., 2005] is a succinct tree representation requiring 2nlog+O(n) bits and allowing operations:

parent (O(1) time) child(x, i) (O(1) time) child(x, a) (O(1) time) Subpath queries (O(m) time)

As we can perform prefix and suffix searching, we can do the work of both LZTrie and RevTrie only with xbw!

xbw LZ-index

P

PP

Subpath search with string P

Page 28: Reducing the Space Requirement of LZ-index

xbw LZ-index

(()()())()(()())(())

Balanced Parentheses LZTrie

S

xbw LZTrie

Slast

i

iPos

+

Pos-1

ids

In principle: (3+)uHk(T)+ o(ulog)

bits

Range

xbw positions

preorder positions

Page 29: Reducing the Space Requirement of LZ-index

xbw LZ-index

(()()())()(()())(())

Balanced Parentheses LZTrie

S xbw LZTrie

Slast

i

j

Pos’

ids

Pos[i]

(2+)uHk(T)+ o(ulog) bits

We store one out of O(1/) values of Pos

Page 30: Reducing the Space Requirement of LZ-index

Occurrences of Type 1: using the xbw (subpath search with Pr), and then mapping to the parentheses LZTrie

Occurrences of Type 2: subpath search for Pr1 and search

(using child from the root) for P2. Then use the corresponding xbw and preorder ranges to search in

Range

Ocurrences of Type 3:mostly as with the original LZ-index

Occurrences of Type 2 can be solved as Occurrences of Type 3 (we don’t need Range!)

xbw LZ-indexWe have achieved Theorem 1 and 2 with radically different means!!

Page 31: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 32: Reducing the Space Requirement of LZ-index

The approach of [Sadakane and Grossi, 2006] to display any text substring of length (logu) in constant time can be adapted to our indexes

Displaying text substrings

Page 33: Reducing the Space Requirement of LZ-index

Outline

Introduction The LZ-index (A Review) LZ-index as a Navigation Scheme Suffix-Links in the Reverse Trie xbw LZ-index Displaying Text Substrings Conclusions

Page 34: Reducing the Space Requirement of LZ-index

We have studied the reduction of the space requirement of LZ-index

Two different approaches

In either case we achieve (2+)uHk(T) + o(ulog) to index T[1…u], k = o(logu)

The search time is improved to O(m2logm + (m+occ)logn) (worst case)

Conclusions

Navigational scheme

xbw + bp LZTrie

Page 35: Reducing the Space Requirement of LZ-index

We also define indexes requiring (1+)uHk(T) + o(ulog) to index T[1…u], k = o(logu)

O(m2) average-case time if m ≥ 2logn The time to display a context of length around any text

position is also improved to the optimal O(/logu) We also remove some restrictions of the original LZ-index

(see the paper)

Conclusions

Page 36: Reducing the Space Requirement of LZ-index

Questions?

Contact

[email protected]

Page 37: Reducing the Space Requirement of LZ-index

Thanks!

Contact

[email protected]