index construction: compression of documents paolo ferragina dipartimento di informatica universit...
DESCRIPTION
Various Approaches Statistical coding Huffman codes Arithmetic codes Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,… Text transforms Burrows-Wheeler Transform bzipTRANSCRIPT
Index construction:Compression of documents
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading Managing-Gigabytes: pg 21-36, 52-56, 74-79
Raw docs are needed
Various ApproachesStatistical coding Huffman codes Arithmetic codes
Dictionary coding LZ77, LZ78, LZSS,… Gzip, zippy, snappy,…
Text transforms Burrows-Wheeler Transform bzip
Basics of Data Compression
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Uniquely Decodable CodesA variable length code assigns a bit string
(codeword) of variable length to every symbol
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence 1011 ?
A uniquely decodable code can always be uniquely decomposed into their codewords.
Prefix CodesA prefix code is a variable length code in
which no codeword is a prefix of another one
e.g a = 0, b = 100, c = 101, d = 11Can be viewed as a binary trie
0 1
a
b c
d
0
0 1
1
Average LengthFor a code C with codeword length L[s], the
average length is defined as
p(A) = .7 [0], p(B) = p(C) = p(D) = .1 [1--]La = .7 * 1 + .3 * 3 = 1.6 bit (Huffman achieves 1.5 bit)
We say that a prefix code C is optimal if for all prefix codes C’, La(C) La(C’)
Ss
a sLspCL ][)()(
Entropy (Shannon, 1948)For a source S emitting symbols with
probability p(s), the self information of s is:
bits
Lower probability higher information
Entropy is the weighted average of i(s)
Ss sp
spSH)(
1log)()( 2
)(1log)( 2 sp
si
s s
s
occT
ToccTH ||log
||)( 20
0-th order empirical entropy of string T
i(s)
0 <= H <= log ||H -> 0, skewed distributionH max, uniform distribution
Performance: Compression ratioCompression ratio =
#bits in output / #bits in input
Compression performance: We relate entropy against compression ratio.
p(A) = .7, p(B) = p(C) = p(D) = .1
H ≈ 1.36 bitsHuffman ≈ 1.5 bits per symb
|||)(|)(0 T
TCvsTH s
scspSH |)(|)()(Shannon In practiceAvg cw lengthEmpirical H vs Compression ratio
|)(|)(|| 0 TCvsTHT
An optimal code is surely one that…
Document Compression
Huffman coding
Huffman CodesInvented by Huffman as a class assignment in ‘50.
Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…
Properties: Generates optimal prefix codes Cheap to encode and decode La(Huff) = H if probabilities are powers of 2
Otherwise, La(Huff) < H +1 < +1 bit per symb on avg!!
Running Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1) b(.2) d(.5)c(.2)
a=000, b=001, c=01, d=1There are 2n-1 “equivalent” Huffman trees
(.3)
(.5)
(1)
What about ties (and thus, tree depth) ?
0
0
0
11
1
Encoding and DecodingEncoding: Emit the root-to-leaf path leading
to the symbol to be encoded.
Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root.
a(.1) b(.2)
(.3) c(.2)
(.5) d(.5)0
0
0
1
1
1
abc... 00000101101001... dcb
Huffman in practiceThe compressed file of n symbols, consists of: Preamble: tree encoding + symbols in leaves Body: compressed text of n symbols
Preamble = (|| log ||) bitsBody is at least nH bits and at most nH+n bits
Extra +n is bad for very skewed distributions, namely ones for which H -> 0
Example: p(a) = 1/n, p(b) = n-1/n
There are better choicesT=aaaaaaaaab
Huffman = {a,b}-encoding + 10 bits RLE = <a,9><b,1> = (9) + (1) + {a,b}-
encoding = 0001001 1 + {a,b}-encoding
So RLE saves 2 bits to Huffman, because it is not a prefix-code. In fact it does not map symbol -> bits uniquely, as Huffman, but the mapping may actually change and, moreover, it uses fractions of bits.
Fax, bzip,… are using RLE
Idea on Huffman?Goal: Reduce the impact of the +1 bit
Solution: Divide the text into blocks of k symbols The +1 is spread over k symbols So the loss is 1/k per symbol
Caution: Alphabet = k, tree gets larger, and so preamble. At the limit, preamble = 1 k-gram = the input text, and the compressed text is 1 bit only.
This means no compression at all !
Document Compression
Arithmetic coding
IntroductionIt uses “fractional” parts of bits!!
Gets < nH(T) + 2 bits vs. < nH(T)+n (Huffman)
Used in JPEG/MPEG (as option), bzip
More time costly than Huffman, but integer implementation is not too bad.
Ideal performance. In practice, it is 0.02 * n
Symbol intervalAssign each symbol to an interval range from 0
(inclusive) to 1 (exclusive).e.g.
a = .2
c = .3
b = .5
cum[c] = p(a)+p(b) = .7
cum[b] = p(a) = .2
cum[a] = .0
The interval for a particular symbol will be calledthe symbol interval (e.g for b it is [.2,.7))
Sequence interval
Coding the message sequence: bac
The final sequence interval is [.27,.3)
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.2
0.22
0.27
0.3(0.7-0.2)*0.3=0.15
(0.3-0.2)*0.5 = 0.05
(0.3-0.2)*0.3=0.03
(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1
(0.7-0.2)*0.5 = 0.25
The algorithmTo code a sequence of symbols with probabilities
pi (i = 1..n) use the following algorithm:
01
0
0
ls
iiii
iii
TcumsllTpss
*11
1 *
p(a) = .2
p(c) = .3
p(b) = .5
0.27
0.2
0.3
2.01.0
1
1
i
i
ls
03.03.0*1.0 is
27.0)5.02.0(*1.02.0 il
The algorithm
Each message narrows the interval by a factor p[Ti]
Final interval size is
10
0
0
sl
n
iin Tps
1
iii
iiii
TpssTcumsll
*1
*11
Sequence interval[ ln , ln + sn )
Take a number inside
Decoding Example
Decoding the number .49, knowing the message is of length 3:
The message is bbc.
a = .2
c = .3
b = .5
0.0
0.2
0.7
1.0
a
c
b
0.2
0.3
0.55
0.7
a
c
b
0.3
0.35
0.475
0.55
0.49 0.49
0.49
How do we encode that number?If x = v/2k (dyadic fraction) then the
encoding is equal to bin(v) over k digits (possibly padded with 0s in front)
0111.16/711.4/3
01.3/1
How do we encode that number?Binary fractional representation:
FractionalEncode(x)1. x = 2 * x2. If x < 1 output 03. else {output 1; x = x - 1; }
.... 54321 bbbbb...2222 4
43
32
21
1 bbbb
01.3/1
2 * (1/3) = 2/3 < 1, output 0
2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation
Which number do we encode?
Truncate the encoding to the first d = log2 (2/sn) bits
Truncation gets a smaller number… how much smaller?
Truncation Compression
d
i
id
i
id
i
ididb
222212
11
)(
1
)(
2222log
2log 22nss snn
ln + sn
ln
ln + sn/2
....... 32154321 dddd bbbbbbbbbx 0∞
Bound on code lengthTheorem: For a text T of length n, the Arithmetic
encoder generates at most log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)
= 2 - log2 (∏ i=1,n p(Ti)) = 2 - log2 (∏ [p()occ()])= 2 - ∑ occ() * log2 p()
≈ 2 + ∑ ( n*p() ) * log2 (1/p())
= 2 + n H(T) bits
T = acabcsn = p(a) *p(c) *p(a) *p(b) *p(c) = p(a)2 * p(b) * p(c)2
Document Compression
Dictionary-based compressors
LZ77
Algorithm’s step: Output <dist, len, next-char> Advance by len + 1
A buffer “window” has fixed length and moves
a a c a a c a b c a a a a a aDictionary
(all substrings starting here)<6,3,a>
<3,4,c>a a c a a c a b c a a a a a a c
a c
a c
LZ77 DecodingDecoder keeps same dictionary window as encoder.
Finds substring and inserts a copy of it
What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor
for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]
Output is correct: abcdcdcdcdcdce
LZ77 Optimizations used by gzipLZSS: Output one of the following formats
(0, position, length) or (1,char)Typically uses the second format if length <
3.Special greedy: possibly use shorter match so
that next match is betterHash Table for speed-up searches on tripletsTriples are coded with Huffman’s code
You find this at: www.gzip.org/zlib/
Google’s solution