lecture #1 from 0-th order entropy compression to k-th order entropy compression

Lecture #1

From 0-th order entropy compressionTo k-th order entropy compression

Entropy (Shannon, 1948)

For a source S emitting symbols with probability p(s), the self information of s is:

Lower probability higher information

Entropy is the weighted average of i(s)

spSH)(

1log)()( 2

1log)( 2 sp

H0 = 0-th order empirical entropy (of a string, where p(s)=freq(s))

Performance

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

|)(|)(0 T

TCvsTH

or|)(|)(|| 0 TCvsTHT

Huffman Code

Invented by Huffman as a class assignment in ‘50.

Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,…

Properties: Generates optimal prefix codes Fast to encode and decode

We can prove that (n=|T|): n H(T) ≤ |Huff(T)| < n H(T) + n

This means that it looses < 1 bit per symbol on avg !

Good or bad ?

Arithmetic coding

Given a text of n symbols it takes nH0 +2 bits vs. (nH0 +n) bits of Huffman

Used in PPM, JPEG/MPEG (as option), …

More time costly than Huffman, but integer implementation is “not bad”.

Symbol interval

Assign each symbol to an interval [0, 1).

a = .2

c = .3

b = .5

f(a) = .0, f(b) = .2, f(c) = .7

e.g. the symbol interval for b is [.2,.7)

Encoding a sequence of symbols

Coding the sequence: bac

The final sequence interval is [.27,.3)

a = .2

c = .3

b = .5

a = .2

c = .3

b = .5

a = .2

c = .3

b = .5

0.3(0.7-0.2)*0.3=0.15

(0.3-0.2)*0.5 = 0.05

(0.3-0.2)*0.3=0.03

(0.3-0.2)*0.2=0.02(0.7-0.2)*0.2=0.1

(0.7-0.2)*0.5 = 0.25

The algorithm

To code a sequence of symbols:

s iiii

P(a) = .2

P(c) = .3

P(b) = .5

03.03.0*1.0 is

27.0)5.02.0(*1.02.0 il

iin Tps

Pick a number inside

Decoding Example

Decoding the number .49, knowing the input text to be decoded is of length 3:

The message is bbc.

a = .2

c = .3

b = .5

a = .2

c = .3

b = .5

a = .2

c = .3

b = .5

0.490.49

How do we encode that number?

Binary fractional representation:

FractionalEncode(x)1. x = 2 * x2. If x < 1 output 0, goto 13. x = x - 1; output 1, goto 1

.... 54321 bbbbbx ...2222 4

1 bbbbx

01.3/1

2 * (1/3) = 2/3 < 1, output 0

2 * (2/3) = 4/3 > 1, output 1 4/3 – 1 = 1/3Incremental Generation

Which number do we encode?

Truncate the encoding to the first d = log (2/sn) bits

Truncation gets a smaller number… how much smaller?

Compression = Truncation

log 22 sssceil

ln + sn

ln + sn/2

....... 32154321 dddd bbbbbbbbbx =0

Bound on code length

Theorem: For a text of length n, the Arithmetic

encoder generates at most

log2 (2/sn) < 1 + log2 2/sn = 1 + (1 - log2 sn)

= 2 - log2 (∏ i=1,n p(Ti))

= 2 - ∑ i=1,n (log2 p(Ti))

= 2 - ∑s=1,|| occ(s) log p(s)

= 2 + n * ∑s=1,|| p(s) log (1/p(s))

= 2 + n H0(T) bits

nH0 + 0.02 n bits in practicebecause of rounding

T = aaba

3 * log p(a) + 1 * log p(b)

Where is the problem ?

Take the text T = an bn, hence H0 = (1/2) log2 2 + (1/2) + log2 2 = 1 bit

so compression ratio would be 1/256 (ASCII)or, no compression if a,b already encoded in 1

We would like to deploy repetitions:

• Wherever they occur

• Whichever length they have

Any P(T), even random, gets the same bound

Data Compression

Can we use simpler repetition-detectors?

Simple compressors: too simple?

Move-to-Front (MTF): As a freq-sorting approximator As a caching strategy As a compressor

Run-Length-Encoding (RLE): FAX compression

-g code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

0000...........0 x in binary Length-1

Move to Front Coding

Transforms a char sequence into an integer sequence, that can then be var-length coded

Start with the list of symbols L=[a,b,c,d,…] For each input symbol s

1) output the position of s in L 2) move s to the front of L

Properties: It is a dynamic code, with memory (unlike Arithmetic)

X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) +

In fact Huff takes log n bits per symbol being them equi-probable

MTF uses O(1) bits per symbol occurrence but first one O(log n)

Run Length Encoding (RLE)

If spatial locality is very high, then

abbbaacccca => (a,1),(b,3),(a,2),(c,4),

In case of binary strings just numbers and one bit

Properties:

It is a dynamic code, with memory (unlike

Arithmetic)

X = 1n 2n 3n… nn

Huff(X) = O(n2 log n) > Rle(X) = O( n (1+log

RLE uses log n bits per symb-block using g-code per its length.

Data Compression

Burrows-Wheeler Transform

The big (unconscious) step...

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The Burrows-Wheeler Transform (1994)

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

A famous example

Muchlonger...

Compressing L seems promising...

Key observation: L is locally

homogeneousL is highly compressible

Algorithm Bzip :

1. Move-to-Front coding of

2. Run-Length coding

3. Statistical coder

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

BWT matrix

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

How to compute the BWT ?

ipssm#pissii

1185211097463

L[3] = T[ 8 - 1 ]

We said that: L[i] precedes F[i] in T

Given SA and T, we have L[i] = T[SA[i]-1]

This is one of the main reasons forthe number of pubblications spurred

in ‘94-’10 on Suffix Array construction

i ssippi#mis s

# mississipp ii #mississip pi ppi#missis s

Take two equal L’s chars

Can we map L’s chars onto F’s chars ?

... Need to distinguish equal chars...

Rotate rightward their rows

Same relative order !!

unknown

A useful tool: L F mapping

Rank(char,pos) and Select(char,pos) key operations nowadays

T = .... #

i #mississip p

i ssippi#mis s

The BWT is invertible

# mississipp i

i ppi#missis s

F Lunknown

1. LF-array maps L’s to F’s chars

2. L[ i ] precedes F[ i ] in T

Two key properties:

Reconstruct T backward:

Several issues about efficiency in time and space

You find this in your Linux distribution

Suffix Array construction

Data Compression

What about achieving high-order entropy ?

Recall that

Compression ratio =

#bits in output / #bits in input

Compression performance: We relate entropy against compression ratio.

|)(|)(0 T

TCvsTH

or|)(|)(|| 0 TCvsTHT

The empirical entropy Hk

Hk(T) = (1/|T|) ∑|w|=k | T[w] | H0(T[w])

Example: Given T = “mississippi”, we have

T[w] = string of symbols that precede the substring w in T

T[“is”] = ms

Compress T up to Hk(T)

compress each T[w] up to its

How much is this “operational” ?

Use Huffmanor Arithmetic

The distinct substrings w for H2(T) are

{i_ (1,p), ip (1,s), is (2,ms), pi (1,p), pp (1,i), mi (1,_), si (2,ss), ss

(2,ii)}

H2(T) = (1/11) * [1 * H0(p) + 1 * H0(s) + 2 * H0(ms) + 1 * H0(p) + …]

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

#mississipp ii#mississip pippi#missis s

BWT versus Hk

Bwt(T)

Compressing pieces in BWT up totheir H0 , we achieve H2(T)

Symbols preceding w

but this is a permutation of T[w]

|T[w]| *

H0(T[w])|T| H2(T) = ∑

H0 does not change !!!

We have a workableway to approximate Hk

via bwt-partitions T = m i s s i s s i p p i # 1 2 3 4 5 6 7 8 9 10 11 12

T[w=is] = “ms”

Let C be a compressor achieving H0

Arithmetic(a) ≤ |a| H0(a) + 2 bits

An interesting approach: Compute bwt(T), and get a partition P induced by k Apply C on each piece of P

The space is

The partition depends on k The approximation of Hk(T) depends on C and gk

Operationally…

Optimal partition P shortest |C(P)| O(n) time Hk-bound holds simultaneously k ≥ 0

Compression booster[J. ACM ‘05]

= ∑|w|=k |C(T[w])| ≤ ∑

|w|=k ( |T[w]| H0 (T[w])

≤ |T| Hk (T) + 2 gk

lecture #1 from 0-th order entropy compression to k-th order entropy compression

n n n huffx

log n bits

olog n slide

log n rlex

s n lnln

n bits of huffman

text of length n

s n bits truncation

Documents

interactive compression to entropy · compression...

image compression - eastern mediterranean...

lecture 14 image compression - bohr.wlu.calecture 14 image...

1 image compression: jpeg multimedia systems (module 4...

entropy of deterministic networks and network ensembles ·...

exploiting statistical dependence for compression joint...

science ss1 thermochemistry 4 th class entropy and free...

thermal & kinetic lecture 13 calculation of entropy,...

shock compression response of high entropy alloys -...

entropy of keys and password generation introduction to...

efficient lightweight compression alongside fast...

how to wring a table dry: entropy compression of relations

deep generative video compression · 2019-11-05 ·...

entropy -...

digital sampling and voice compression 20 th / jun/2015

lossless compression in lossy compression...

compression algorithm calculating...

sims-201 compressing information. 2 overview chapter 7:...

compression in the real world - carnegie mellon school of...

implementing entropy codec for h.264 video compression