index construction: compression of documents paolo ferragina dipartimento di informatica universit...

Index construction:Compression of documents

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79

Raw docs are needed

Various ApproachesStatistical coding Huffman codes Arithmetic codes

Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…

Text transforms Burrows-Wheeler Transform bzip

Basics of Data Compression

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Uniquely Decodable CodesA variable length code assigns a bit string

(codeword) of variable length to every symbol

e.g. a = 1, b = 01, c = 101, d = 011

What if you get the sequence 1011 ?

A uniquely decodable code can always be uniquely decomposed into their codewords.

Prefix CodesA prefix code is a variable length code in

which no codeword is a prefix of another one

e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie

0 1

a

b c

d

0

0 1

1

Average LengthFor a code C with codeword length L[s], the

average length is defined as

p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)

We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)

Ss

a sLspCL ][)()(

Entropy (Shannon, 1948)For a source S emitting symbols with

probability p(s), the self information of s is:

bits

Lower probability higher information

Entropy is the weighted average of i(s)

Ss sp

spSH)(

1log)()( 2

)(1log)( 2 sp

si

s s

s

occT

ToccTH ||log

||)( 20

0-th order empirical entropy of string T

i(s)

0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution

Performance: Compression ratioCompression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

p(A) = .7, p(B) = p(C) = p(D) = .1

H ≈ 1.36 bitsHuffman ≈ 1.5 bits per symb

|||)(|)(0 T

TCvsTH s

scspSH |)(|)()(Shannon In practiceAvg cw lengthEmpirical H vs Compression ratio

|)(|)(|| 0 TCvsTHT

An optimal code is surely one that…

Document Compression

Huffman coding

Huffman CodesInvented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2

Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!

Running Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5

a(.1) b(.2) d(.5)c(.2)

a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees

(.3)

(.5)

(1)

What about ties (and thus, tree depth) ?

0

0

0

11

1

Encoding and DecodingEncoding: Emit the root-to-leaf path leading

to the symbol to be encoded.

Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.

a(.1) b(.2)

(.3) c(.2)

(.5) d(.5)0

0

0

1

1

1

abc... 00000101101001... dcb

Huffman in practiceThe compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols

Preamble = (|| log ||) bitsBody is at least nH bits and at most nH+n bits

Extra +n is bad for very skewed distributions, namely ones for which H -> 0

Example: p(a) = 1/n, p(b) = n-1/n

There are better choicesT=aaaaaaaaab

Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-

encoding = 0001001 1 + {a,b}-encoding

So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.

Fax, bzip,… are using RLE

Idea on Huffman?Goal: Reduce the impact of the +1 bit

Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol

Caution: Alphabet = k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.

This means no compression at all !


Arithmetic coding

IntroductionIt uses “fractional” parts of bits!!

Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)

Used in JPEG/MPEG (as option), bzip

More time costly than Huffman, but integer implementation is not too bad.

Ideal performance. In practice, it is 0.02 * n

Symbol intervalAssign each symbol to an interval range from 0

(inclusive) to 1 (exclusive).e.g.

a = .2

c = .3

b = .5

cum[c] = p(a)+p(b) = .7

cum[b] = p(a) = .2

cum[a] = .0

The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))

Sequence interval

Coding the message sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.2

0.22

0.27

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

The algorithmTo code a sequence of symbols with probabilities

pi (i = 1..n) use the following algorithm:

01

0

0

ls

iiii

iii

TcumsllTpss

*11

1 *

p(a) = .2

p(c) = .3

p(b) = .5

0.27

0.2

0.3

2.01.0

1

1

i

i

ls

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

The algorithm

Each message narrows the interval by a factor p[Ti]

Final interval size is

10

0

0

sl

n

iin Tps

1

iii

iiii

TpssTcumsll

*1

*11

Sequence interval[ ln , ln + sn )

Take a number inside

Decoding Example

Decoding the number .49, knowing the message is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

0.0

0.2

0.7

1.0

a

c

b

0.2

0.3

0.55

0.7

a

c

b

0.3

0.35

0.475

0.55

0.49 0.49

0.49

How do we encode that number?If x = v/2k (dyadic fraction) then the

encoding is equal to bin(v) over k digits (possibly padded with 0s in front)

0111.16/711.4/3

01.3/1

How do we encode that number?Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }

.... 54321 bbbbb...2222 4

43

32

21

1 bbbb

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Which number do we encode?

Truncate the encoding to the first d = log2 (2/sn) bits

Truncation gets a smaller number… how much smaller?

Truncation Compression

d

i

id

i

id

i

ididb

222212

11

)(

1

)(

2222log

2log 22nss snn

ln + sn

ln

ln + sn/2

....... 32154321 dddd bbbbbbbbbx 0∞

Bound on code lengthTheorem: For a text T of length n, the Arithmetic

encoder generates at most log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti)) = 2 - log2 (∏ [p()occ()])= 2 - ∑ occ() * log2 p()

≈ 2 + ∑ ( n*p() ) * log2 (1/p())

= 2 + n H(T) bits

T = acabcsn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2


Dictionary-based compressors

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

LZ77 DecodingDecoder keeps same dictionary window as encoder.

Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

LZ77 Optimizations used by gzipLZSS: Output one of the following formats

(0, position, length) or (1,char)Typically uses the second format if length <

3.Special greedy: possibly use shorter match so

that next match is betterHash Table for speed-up searches on tripletsTriples are coded with Huffman’s code

You find this at: www.gzip.org/zlib/

Google’s solution

index construction: compression of documents paolo ferragina dipartimento di informatica universit...

Documents