statistical encoding of succinct data structuresstelo/cpm/cpm06/24-gonzalez.pdfdata structures...

80
Outline Statistical Encoding of Succinct Data Structures Rodrigo Gonz ´ alez 1 Gonzalo Navarro 1 1 Department of Computer Science Universidad de Chile Combinatorial Pattern Matching, 2006 Gonz ´ alez, Navarro Statistical Encoding of Succinct Data Structures

Upload: others

Post on 21-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

Outline

Statistical Encoding ofSuccinct Data Structures

Rodrigo Gonzalez1 Gonzalo Navarro1

1Department of Computer ScienceUniversidad de Chile

Combinatorial Pattern Matching, 2006

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 2: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

Outline

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 3: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

Outline

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 4: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

Outline

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 5: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 6: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Previous work

In recent work, Sadakane and Grossi [SODA’06]introduced a scheme to represent any sequence S usingnHk (S) + O( n

logσ n ((k + 1) log σ + log log n)) bits of space.

The representation permits us to extract any substring ofsize Θ(logσ n) in constant time, and thus it completelyreplaces S under the RAM model.

This permits converting any succinct structure usingo(n log σ) bits of space on top of S, into a compressedstructure using nHk (S) + o(n log σ) bits overall, for anyk = o(logσ n).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 7: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Previous work

In recent work, Sadakane and Grossi [SODA’06]introduced a scheme to represent any sequence S usingnHk (S) + O( n

logσ n ((k + 1) log σ + log log n)) bits of space.

The representation permits us to extract any substring ofsize Θ(logσ n) in constant time, and thus it completelyreplaces S under the RAM model.

This permits converting any succinct structure usingo(n log σ) bits of space on top of S, into a compressedstructure using nHk (S) + o(n log σ) bits overall, for anyk = o(logσ n).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 8: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Previous work

In recent work, Sadakane and Grossi [SODA’06]introduced a scheme to represent any sequence S usingnHk (S) + O( n

logσ n ((k + 1) log σ + log log n)) bits of space.

The representation permits us to extract any substring ofsize Θ(logσ n) in constant time, and thus it completelyreplaces S under the RAM model.

This permits converting any succinct structure usingo(n log σ) bits of space on top of S, into a compressedstructure using nHk (S) + o(n log σ) bits overall, for anyk = o(logσ n).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 9: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Our work

We extend previous works, by obtaining slightly betterspace complexity and the same time complexity using asimpler scheme based on statistical encoding.

We show that the scheme supports appending symbols inconstant amortized time.

We prove some results on the applicability of the schemefor full-text self-indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 10: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Our work

We extend previous works, by obtaining slightly betterspace complexity and the same time complexity using asimpler scheme based on statistical encoding.

We show that the scheme supports appending symbols inconstant amortized time.

We prove some results on the applicability of the schemefor full-text self-indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 11: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Motivation

Our work

We extend previous works, by obtaining slightly betterspace complexity and the same time complexity using asimpler scheme based on statistical encoding.

We show that the scheme supports appending symbols inconstant amortized time.

We prove some results on the applicability of the schemefor full-text self-indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 12: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Definition

rank1(S, i) = number of ones in S[1 . . . i].

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 13: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 14: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 15: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 16: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 17: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 18: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Example: a simple rank structure

rank1(S, 14) = 5 + 1 + 1.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 19: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 20: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

The k -th order empirical entropy

Definition

The empirical entropy is defined for any string S and canbe used to measure the performance of compressionalgorithms without any assumption on the input.

The k -th order empirical entropy captures the dependenceof symbols upon their context. For k ≥ 0, nHk (S) providesa lower bound to the output of any compressor thatconsiders a context of size k to encode every symbol of S.

Hk (S) =1n

∑w∈Σk

|wS|H0 (wS) . (1)

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 21: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

The k -th order empirical entropy

Definition

The empirical entropy is defined for any string S and canbe used to measure the performance of compressionalgorithms without any assumption on the input.

The k -th order empirical entropy captures the dependenceof symbols upon their context. For k ≥ 0, nHk (S) providesa lower bound to the output of any compressor thatconsiders a context of size k to encode every symbol of S.

Hk (S) =1n

∑w∈Σk

|wS|H0 (wS) . (1)

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 22: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 23: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Semi-static Statistical encoding

Descriptions

Given a k -th order modeler, which will yield theprobabilities p1, p2, . . . , pn for the symbols, we will encodethe successive symbols of S trying to use − log pi bits forsi . If we reach exactly − log pi bits, the overall number ofbits produced will be nHk (S) + O(k log n).

Different encoders provide different approximations to theideal − log pi bits (Huffman coding, Arithmetic coding).

Given a statistical encoder E and a semi-static modelerover sequence S[1, n], we call E(S) the bitwise output ofE . We call fk (E , S) the extra space in bits needed toencode S using E , on top of nHk (S).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 24: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Semi-static Statistical encoding

Descriptions

Given a k -th order modeler, which will yield theprobabilities p1, p2, . . . , pn for the symbols, we will encodethe successive symbols of S trying to use − log pi bits forsi . If we reach exactly − log pi bits, the overall number ofbits produced will be nHk (S) + O(k log n).

Different encoders provide different approximations to theideal − log pi bits (Huffman coding, Arithmetic coding).

Given a statistical encoder E and a semi-static modelerover sequence S[1, n], we call E(S) the bitwise output ofE . We call fk (E , S) the extra space in bits needed toencode S using E , on top of nHk (S).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 25: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Semi-static Statistical encoding

Descriptions

Given a k -th order modeler, which will yield theprobabilities p1, p2, . . . , pn for the symbols, we will encodethe successive symbols of S trying to use − log pi bits forsi . If we reach exactly − log pi bits, the overall number ofbits produced will be nHk (S) + O(k log n).

Different encoders provide different approximations to theideal − log pi bits (Huffman coding, Arithmetic coding).

Given a statistical encoder E and a semi-static modelerover sequence S[1, n], we call E(S) the bitwise output ofE . We call fk (E , S) the extra space in bits needed toencode S using E , on top of nHk (S).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 26: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Semi-static Statistical encoding

Encoders

Arithmetic coding essentially expresses S using a numberin [0, 1) which lies within a range of size P = p1 · p2 · · ·pn.We need − log P = −

∑log pi bits to distinguish a number

within that range (plus two extra bits for technical reasons).

These are usually some limitations to the near-optimalityachieved by Arithmetic coding in practice. They arescaling, very low probabilities and adaptive encoding.None of them is a problem in our scheme.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 27: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Motivationk -th order empirical entropyStatistical encoding

Semi-static Statistical encoding

Encoders

Arithmetic coding essentially expresses S using a numberin [0, 1) which lies within a range of size P = p1 · p2 · · ·pn.We need − log P = −

∑log pi bits to distinguish a number

within that range (plus two extra bits for technical reasons).

These are usually some limitations to the near-optimalityachieved by Arithmetic coding in practice. They arescaling, very low probabilities and adaptive encoding.None of them is a problem in our scheme.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 28: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 29: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Entropy-bound succinct data structure

Idea

Given a sequence S[1, n] over an alphabet A of size σ, weencode S into a compressed data structure S′ withinentropy bounds. To perform all the original operations overS under the RAM model, it is enough to allow extractingany b = 1

2 logσ n consecutive symbols of S, using S′, inconstant time.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 30: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 31: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 32: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 33: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 34: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 35: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 36: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Data structures

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 37: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 38: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Decoding Algorithm

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 39: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Decoding Algorithm

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 40: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Decoding Algorithm

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 41: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Decoding Algorithm

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 42: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 43: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 44: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 45: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 46: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 47: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 48: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Size of U

|U| =≤∑bn/bc

i=0 |Ei | = nHk (S)+O(k log n)+∑bn/bc

i=0 fk (E , Si),which depends on the statistical encoder E used.

Huffman: fk (Huffman, Si) < b, thus we achivenHk (S) + O(k log n) + n bits.

Arithmetic: fk (Arithmetic, Si) ≤ 2, thus we achivenHk (S) + O(k log n) + 4n

logσ n bits.

Other structures

Contexts: (n/b)k log σ = O(nk log σ/ logσ n)

Positions: O(n log log n/ logσ n)

Table: σk n1/2 log n/2Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 49: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Space requirement

Theorem

Let S[1, n] be a sequence over an alphabet A of size σ. Ourdata structure uses nHk (S) + O( n

logσ n (k log σ + log log n)) bitsof space for any k < (1 − ε) logσ n and any constant 0 < ε < 1,and it supports access to any substring of S of size Θ(logσ n)symbols in O(1) time.

Corollary

Our structure takes space nHk (S) + o(n log σ) if k = o(logσ n).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 50: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 51: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

IdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

Supporting appends

Theorem

The structure supports appending symbols in constantamortized time and retains the same space and query timecomplexities.

Append scheme

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 52: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 53: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Succinct full-text self-indexes

Definition

A succinct full-text index is an index that uses spaceproportional to the compressed text. Those indexes thatcontain sufficient information to recreate the original textare known as self-indexes. Some examples are theFM-index family and the LZ-index.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 54: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 55: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 56: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 57: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 58: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 59: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Burrows-Wheeler Transform (BWT)

BWT

The FM-index family is based on the Burrows-WheelerTransform (BWT). The BWT of a text T , T bwt = bwt(T ), isa reversible transformation from strings to strings, which iseasier to compress by local optimization methods.An important property of the transformation is: ifT [k ] = T bwt [i], then T [k − 1] = T bwt [LF (i)], where

LF (i) = C[T bwt [i]] + Occ(T bwt [i], i).C[c] is the total number of text characters which arealphabetically smaller than c.Occ(c, i) is the number of occurrences of character c in theprefix T bwt [1, i].

This property permits navigating the text T backwards.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 60: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Succinct full-text self-indexes

Wavelet tree

The original FM-index solves Occ by storing somedirectories over S and compressing S. To giveconstant-time access to S they require exponential spacein σ.

The wavelet tree wt(S) built on S is a binary tree, built onthe alphabet symbols, such that the root represents thewhole alphabet and each node has the information tellingwhich of its characters belongs to the left/right child.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 61: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Succinct full-text self-indexes

Wavelet tree

The original FM-index solves Occ by storing somedirectories over S and compressing S. To giveconstant-time access to S they require exponential spacein σ.

The wavelet tree wt(S) built on S is a binary tree, built onthe alphabet symbols, such that the root represents thewhole alphabet and each node has the information tellingwhich of its characters belongs to the left/right child.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 62: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 63: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 64: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 65: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 66: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 67: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Relationship between T bwt and T

We could encode S = bwt(T ) within nHk (S) + o(n log σ) bits,but how this relates to nHk (T )?

Lemma

Let S = bwt(T ), where T [1, n] is a text over an alphabet of sizeσ. Then H1(S) ≤ 1 + Hk (T ) log σ + o(1) for anyk < (1 − ε) logσ n and any constant 0 < ε < 1.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 68: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Relationship between T bwt and T

Application

We can get at least the same results of the Run-LengthFM-Index by compressing bwt(T ) using our structure.

We can implement the original FM-index (5nHk(T ) +O(nσ log log n/ logσ n + (σ/e)σ+3/2nγ logσ n log log n) bits)using nHk (T ) log σ + n + o(n) bits.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 69: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Outline1 Background

MotivationThe k -th order empirical entropyStatistical encoding

2 Entropy-bound succinct data structureIdeaData structuresDecoding AlgorithmSpace requirementSupporting appends

3 Application to full-text indexingSuccinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 70: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Relationship between wt(S) and S

wt(S) takes nH0 + o(n log σ) bits of space and permitsanswering Occ queries in time O(log σ)

Many FM-index variants build on the wavelet tree:SSA takes nH0 + o(n log σ) bits of spaceRLFM-index takes nHk log σ + o(n log σ)AF-FM-index takes nHk + o(n log σ)

In all cases the bitmaps of the wt(S) are compressed totheir H0, but we can now compress them to Hk .

Is k -th order entropy preserved across a wavelet tree? (it isfor k = 0)

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 71: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Succinct full-text self-indexesThe Burrows-Wheeler TransformThe wavelet tree

Lemma

The ratio between Hk (wt(S)) and Hk (S), can be at leastΩ(log k). More precisely, Hk (wt(S))/Hk (S) can be Ω(log k) andHk (S)/Hk (wt(S)) can be Ω(n/(k log n)).

Consequence

Applying our structure over the bitmaps of the wavelet treedoes not perfectly translate into nHk (S) overall space, as thereis a penalty factor of at least k in the worst case. But in thebest, it can be much better than nHk (S).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 72: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Summary

We presented a scheme based on k -th order modelingplus statistical encoding to convert any succinct datastructure on sequences into a compressed data structure.

This simplifies and slightly improves previous work.

We presented a scheme to append symbols to the originalsequence within the same space complexity and withconstant amortized cost per appended symbol.

We found relationships between the entropies of twofundamental structures used for compressed text indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 73: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Summary

We presented a scheme based on k -th order modelingplus statistical encoding to convert any succinct datastructure on sequences into a compressed data structure.

This simplifies and slightly improves previous work.

We presented a scheme to append symbols to the originalsequence within the same space complexity and withconstant amortized cost per appended symbol.

We found relationships between the entropies of twofundamental structures used for compressed text indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 74: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Summary

We presented a scheme based on k -th order modelingplus statistical encoding to convert any succinct datastructure on sequences into a compressed data structure.

This simplifies and slightly improves previous work.

We presented a scheme to append symbols to the originalsequence within the same space complexity and withconstant amortized cost per appended symbol.

We found relationships between the entropies of twofundamental structures used for compressed text indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 75: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Summary

We presented a scheme based on k -th order modelingplus statistical encoding to convert any succinct datastructure on sequences into a compressed data structure.

This simplifies and slightly improves previous work.

We presented a scheme to append symbols to the originalsequence within the same space complexity and withconstant amortized cost per appended symbol.

We found relationships between the entropies of twofundamental structures used for compressed text indexing.

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 76: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Future work

Making our structure fully dynamic

Better understanding how the entropies evolve upontransformations such bwt or wt .

Testing our structure in practice.

Currently working on another way to solve the sameproblem. That would permit full dynamism using recentwork (see next talk).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 77: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Future work

Making our structure fully dynamic

Better understanding how the entropies evolve upontransformations such bwt or wt .

Testing our structure in practice.

Currently working on another way to solve the sameproblem. That would permit full dynamism using recentwork (see next talk).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 78: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Future work

Making our structure fully dynamic

Better understanding how the entropies evolve upontransformations such bwt or wt .

Testing our structure in practice.

Currently working on another way to solve the sameproblem. That would permit full dynamism using recentwork (see next talk).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 79: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Future work

Making our structure fully dynamic

Better understanding how the entropies evolve upontransformations such bwt or wt .

Testing our structure in practice.

Currently working on another way to solve the sameproblem. That would permit full dynamism using recentwork (see next talk).

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures

Page 80: Statistical Encoding of Succinct Data Structuresstelo/cpm/cpm06/24-gonzalez.pdfData structures Decoding Algorithm Space requirement Supporting appends 3 Application to full-text indexing

BackgroundEntropy-bound succinct data structure

Application to full-text indexingSummary

Summary

Thank you!!

Gonzalez, Navarro Statistical Encoding of Succinct Data Structures