t-61.182 information theory and machine learning data

T-61.182 Information Theory and Machine Learning

Data Compression (Chapters 4-6)

presented by Tapani Raiko

Feb 26, 2004

Contents (Data Compression)

Chap. 4 Chap. 5 Chap. 6

Data Block Symbol Stream

Lossy? Lossy Lossless Lossless

Result Shannon’s source Huffman coding Arithmetic coding

coding theorem algorithm algorithm

Weighting Problem (What is information?)

• 12 balls, all equal in weight except for one

• Two-pan balance to use

• Determine which is the odd ball and whether it is heavier or lighter

• As few uses of the balance as possible!

• The outcome of a random experiment is guaranteed to be most

informative if the probability distribution over outcomes is uniform

. An optimal solution to the weighing problem. At each step there are two boxes: the

1+

2+

3+

4+

5+

6+

7+

8+

9+

10+

11+

12+

1−

2−

3−

4−

5−

6−

7−

8−

9−

10−

11−

12−

1 2 3 4

5 6 7 8

weigh

��

BBBBBBBBBBBBBBBN

-

1+

2+

3+

4+

5−

6−

7−

8−

1 2 6

3 4 5

weigh

1−

2−

3−

4−

5+

6+

7+

8+

1 2 6

3 4 5

weigh

9+

10+

11+

12+

9−

10−

11−

12−

9 10 11

1 2 3

weigh

��

AAAAAU

-

��

AAAAAU

-

��

AAAAAU

-

1+2+5−1

2

3+4+6−3

4

7−8−1

7

6+3−4−3

4

1−2−5+ 1

2

7+8+ 7

1

9+10+11+ 9

10

9−10−11−9

10

12+12−12

1

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

��

@@R

-

1+

2+

5−

3+

4+

6−

7−

8−

?

4−

3−

6+

2−

1−

5+

7+

8+

?

9+

10+

11+

10−

9−

11−

12+

12−

?

Definitions

• Shannon information content:

h(x = ai) ≡ log2

1

pi

• Entropy:

H(X) =∑

i

pi log2

1

pi

• Both are additive for independent variables

0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1 p

h(p) = log2

1

pp h(p) H2(p)

0.001 10.0 0.0110.01 6.6 0.0810.1 3.3 0.470.2 2.3 0.720.5 1.0 1.0

H2(p)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 p

Figure 4.1 shows the Shannon information content of an outcome with prob-

Game of Submarine

• Player hides a submarine in one square of an 8 by 8 grid

• Another player trys to hit it

A

B

C

D

E

F

G

H

87654321

×j ×

×j

×

j

××××××××

××××××××

××××××××

×

×

××××

×

×

×××××××××

××××××××

××××××××

×

×

××××

× ×

×

××××××××

×××××××j

×

×××××××××

××××××××

××××××××

×

×

××××

× ×

×

××××××××

×××××××jS

move # 1 2 32 48 49question G3 B1 E5 F3 H3outcome x = n x = n x = n x = n x = y

P (x)63

64

62

63

32

33

16

17

1

16

h(x) 0.0227 0.0230 0.0443 0.0874 4.0

Total info. 0.0227 0.0458 1.0 2.0 6.0

• Compare to asking 6 yes/no questions about the location

Raw Bit Content

• A binary name is given to each outcome of a random variable X

• The length of the names would be log2 |AX |

(assuming |AX | happens to be a power of 2)

• Define: The raw bit content of X is

H0(X) = log2 |AX |

• Simply counts the possible outcomes - no compression yet

• Additive: H0(X, Y ) = H0(X) + H0(Y )

Lossy Compression

• Let

AX = { a, b, c, d, e, f, g, h }

PX ={

1

4, 1

4, 1

4, 3

16, 1

64, 1

64, 1

64, 1

64

}

• The raw bit content is 3 bits (8 binary

names)

• If we are willing to run a risk of δ = 1/16

of not having a name for x, then we can

get by with 2 bits (4 names)

δ = 0

x c(x)

a 000

b 001

c 010

d 011

e 100

f 101

g 110

h 111

δ = 1/16

x c(x)

a 00

b 01

c 10

d 11

e −f −g −h −

-log

2P (x)

−2−2.4−4−6

S0S 1

16

a,b,cde,f,g,h

666

The outcomes of X ranked by their probability

Essential Bit Content

• Allow an error with probability δ

• Choose the smallest sufficient subset Sδ such that

P (x ∈ Sδ) ≥ 1 − δ

(arrange the elements of AX in order of decreasing probability and

take enough from beginning)

• Define: The essential bit content of X is

Hδ(X) = log2 |Sδ|

• Note that the raw bit content H0 is a special case of Hδ

Hδ(X)

0

0.5

1

1.5

2

2.5

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

{a,b}

{a,b,c}

{a,b,c,d}

{a,b,c,d,e}

{a,b,c,d,e,f}

{a}

{a,b,c,d,e,f,g}{a,b,c,d,e,f,g,h}

δ

The essential bit content as the function of allowed probability of error

Extended Ensembles (Blocks)

• Consider a tuple of N i.i.d. random variables

• Denote by XN the ensemble (X1, X2, . . . , XN )

• Entropy is additive: H(XN) = NH(X)

• Example: N flips of a bent coin: p0 = 0.9, p1 = 0.1

-

log2 P (x)

0−2−4−6−8−10−12−14

S0.01 S0.1

00000010, 0001, . . .0110, 1010, . . .1101, 1011, . . .1111

66666

Outcomes of the bent coin ensemble X4

Hδ(X4)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

N=4

δ

Essential bit content of the bent coin ensemble X4

Hδ(X10)

0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

N=10

δ

Essential bit content of the bent coin ensemble X10

1N

Hδ(XN )

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

N=10N=210N=410N=610N=810

N=1010

δEssential bit content per toss

Shannon’s Source Coding Theorem

Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that

for N > N0,∣

∣

∣

∣

1

NHδ(X

N) − H(X)

∣

∣

∣

∣

< ε.

1

NHδ(X

N )

H0(X)

0 1 δ

H − ε

H

H + ε

• Proof involves

– Law of large numbers

– Chebyshev’s inequality

x log2(P (x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11... −50.1

......................1.....1.....1.......1....1.........1.....................................1.... −37.3

........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. −65.91.1...1................1.......................11.1..1............................1.....1..1.11..... −56.4...11...........1...1.....1.1......1..........1....1...1.....1............1......................... −53.2..............1......1.........1.1.......1..........1............1...1......................1....... −43.7.....1........1.......1...1............1............1...........1......1..11........................ −46.8.....1..1..1...............111...................1...............1.........1.1...1...1.............1 −56.4.........1..........1.....1......1..........1....1..............................................1... −37.3......1........................1..............1.....1..1.1.1..1...................................1. −43.71.......................1..........1...1...................1....1....1........1..11..1.1...1........ −56.4...........11.1.........1................1......1.....................1............................. −37.3.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... −56.4......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. −59.5............11.1......1....1..1............................1.......1..............1.......1......... −46.8

.................................................................................................... −15.21111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −332.1

Some samples from X100. Compare to H(X100) = 46.9 bits.

Typicality

• A string contains r 1s and N − r 0s

• Consider r as a random variable (binomial distribution)

• Mean and std: r ∼ Np1 ±√

Np1(1 − p1)

• A typical string is a one with r ' Np1

• In general, information content within N [H(X) ± β]

log2

1

P (x)= N

∑

i

pi log2

1

pi

' NH(X)

N = 100 N = 1000

n(r) =(

Nr

)

0

2e+28

4e+28

6e+28

8e+28

1e+29

1.2e+29

0 10 20 30 40 50 60 70 80 90 1000

5e+298

1e+299

1.5e+299

2e+299

2.5e+299

3e+299

0 100 200 300 400 500 600 700 800 9001000

log2 P (x)

-350

-300

-250

-200

-150

-100

-50

0

0 10 20 30 40 50 60 70 80 90 100

T

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 100 200 300 400 500 600 700 800 9001000

T

n(r)P (x) =(

Nr

)

pr1(1− p1)

N−r

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 10 20 30 40 50 60 70 80 90 1000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 100 200 300 400 500 600 700 800 9001000

r r

Anatomy of the typical set T

-

log2 P (x)

−NH(X)

TNβ

66666

0000000000000. . . 00000000000

0001000000000. . . 00000000000

0100000001000. . . 00010000000

0000100000010. . . 00001000010

1111111111110. . . 11111110111

Outcomes of XN ranked by their probability and the typical set TNβ

Shannon’s source coding theorem(verbal statement)

N i.i.d. random variables each with entropy H(X) can be compressed

into more than NH(X) bits with negligible risk of information loss, as

N → ∞;

conversely if they are compressed into fewer than NH(X) bits it is

virtually certain that information will be lost.

End of Chapter 4






Contents, Chap. 5: Symbol Codes

• Lossless coding: shorter encodings to the more probable outcomes

and longer encodings to the less probable

• Practical to decode?

• Best achievable compression?

• Source coding theorem (symbol codes):

The expected length L(C, X) ∈ [H(X), H(X) + 1).

• Huffman coding algorithm

Definitions

• A (binary) symbol code is a mapping from Ax to {0, 1}+

• c(x) is the codeword of x and l(x) its length

• Extended code c+(x1x2 . . . xN ) = c(x1)c(x2) . . . c(xN )

(no punctuation)

• A code C(X) is uniquely decodable if no two distinct strings have

the same encoding

• A symbol code is called a prefix code if no codeword is a prefix of

any other codeword (constraining to prefix codes doesn’t lose any

performance)

Examples

AX = {a, b, c, d},

PX =

{

1

2,1

4,1

8,1

8

}

,

• Using C0:

c+(acdba) = 10000010000101001000

• Code C1 = {0, 101} is a prefix code so it

can be represented as a tree

• Code C2 = {1, 101} is a not prefix code

because 1 is a prefix of 101

C0:

ai c(ai) li

a 1000 4b 0100 4c 0010 4d 0001 4

C1

0

1

0

0

1 101

Expected length

• Expected length L(C, X) of a symbol code C for ensemble X is

L(C, X) =∑

x∈AX

P (x) l(x).

• Bounded below by H(X) (uniquely decodeable code)

• Equal to H(X) only if the codelengths are equal to the Shannon

information contents:

li = log2(1/pi)

• Codelengths implicitly define a probability distribution {qi}

qi ≡ 2−li

Examples

C3

0

1

0

100

1

1 111

0 110

C4

0

1

0

1

00

0

1

01

10

11

eighingcan

states.

• L(C3, X) = 1.75 = H(X)

• L(C4, X) = 2 > H(X)

• L(C5, X) = 1.25 < H(X)

C3:

ai c(ai) pi h(pi) li

a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3

C4 C5

a 00 0

b 01 1

c 10 00

d 11 11

Example

C6:

ai c(ai) pi h(pi) li

a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3

• L(C6, X) = 1.75 = H(X)

• C6 is not a prefix code but is in fact uniquely decodable

Kraft Inequality

• If a code is uniquely decodeable its lengths must satisfy∑

i

2−li ≤ 1

• For any lengths satisfying the Kraft inequality, there exists a prefix

code with those lengths

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

The

tota

l sy

mbol

code

budget

Source coding theorem for symbol codes

• By setting

li = dlog2(1/pi)e,

where dl∗e denotes the smallest integer greater than or equal to l∗,

we get (with Kraft’s inequality):

• There exists a prefix code C with

H(X) ≤ L(C, X) < H(X) + 1

• Relative entropy DKL(p||q) measures how many bits per symbol are

wasted

L(C, X) =∑

i

pi log(1/qi) = H(X) + DKL(p||q)

Huffman Coding Algorithm

1. Take two least probable symbols in the alphabet

2. Give them the longest codewords differing only in the last digit

3. Combine them into a single symbol and repeatP { }

0.25

0.25

0.2

0.15

0.15

0.25

0.25

0.2

0.3

0.25

0.45

0.3

0.55

0.45

1.0a

b

c

d

e

0

1

0

1

0

1

0

1

��

��

��

��

��

x step 1 step 2 step 3 step 4 ai pi h(pi) li c(ai)

a 0.25 2.0 2 00

b 0.25 2.0 2 10

c 0.2 2.3 2 11

d 0.15 2.7 3 010

e 0.15 2.7 3 011

Optimality

• Huffman coding is optimal in two senses:

– Smallest expected codelength of uniquely decodable symbol

codes

– Prefix code → easy to decode

• But:

– The overhead of between 0 and 1 bits per symbol is important if

H(X) is small → compress blocks of symbols to make H(X)

larger

– Does not take context into account (symbol code vs. stream

code)

End of Chapter 5






Guessing Game

• Human was asked to guess a sentence character by character

• The numbers of guesses are listed below each character

T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E -

1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1

• One could encode only the string 1, 1, 1, 5, 1, . . .

• Decoding requires an identical twin who also plays the guessing

game

Arithmetic Coding (1/2)

• Human predictor is replaced by a probabilistic model of the source

• The model supplies a predictive distribution over the next symbol

• It can handle complex adaptive models (context-dependent)

• Binary strings define real intervals

within the real line [0, 1)

• The string 01 corresponds to

[0.01, 0.10) in binary or [0.25, 0.50)

in base ten

0.00

0.25

0.50

0.75

1.00

6

?

0

6

?

1

6?01

01101�

Arithmetic Coding (2/2)

• Divide the real line [0, 1) into I intervals of lengths equal to the

probabilities P (x1 = ai)to the probabilities P (x1 =ai), as shown in figure 6.2.

0.00

P (x1 =a1)

P (x1 =a1) + P (x1 =a2)

P (x1 =a1) + . . . + P (x1 =aI−1)

1.0

...

6?a1

6

?

a2

6?aI

...

a2a5�

a2a1�

• Pick an interval and subdivide it (and iterate)

• Send a binary string whose interval lies within that interval

a

b

2

ba

bb

b2

bba

bbb

bb2

bbba

bbbb

bbb2

0

1

00

01

000

001

0000

0001

00000

00001

00010

00011

0010

0011

00100

00101

00110

00111

010

011

0100

0101

01000

01001

01010

01011

0110

0111

01100

01101

01110

01111

10

11

100

101

1000

1001

10000

10001

10010

10011

1010

1011

10100

10101

10110

10111

110

111

1100

1101

11000

11001

11010

11011

1110

1111

11100

11101

11110

11111

��

BBBB

100111101CCCCO

bbba

bbbaa

bbbab

bbba2

10011

10010111

10011000

10011001

10011010

10011011

10011100

10011101

10011110

10011111

10100000

Figure 6.4. Illustration of thearithmetic coding process as thesequence bbba2 is transmitted.

a

b

2

aa

ab

a2

aaa

aab

aa2

aaaa

aaab

aabaaabb

aba

abb

ab2

abaaabababbaabbb

ba

bb

b2

baa

bab

ba2

baaabaabbabababb

bba

bbb

bb2

bbaabbab

bbba

bbbb

0

1

00

01

000

001

0000

0001

00000

00001

00010

00011

0010

0011

00100

00101

00110

00111

010

011

0100

0101

01000

01001

01010

01011

0110

0111

01100

01101

01110

01111

10

11

100

101

1000

1001

10000

10001

10010

10011

1010

1011

10100

10101

10110

10111

110

111

1100

1101

11000

11001

11010

11011

1110

1111

11100

11101

11110

11111

On Arithmetic Coding

• Computationally efficient

• Length of a string closely matches the Shannon information content

• Overhead required to terminate a message is never more than 2 bits

⇒ Finding a good coding equivalent to finding a good probabilistic

model!

• Flexible:

– any source alphabet and any encoded alphabet

– alphabets can change with time

– probabilities are context-dependent

• Can be used to generate random samples from random bits

economically

Lempel-Ziv Coding

• Used in gzip etc.

source substrings λ 1 0 11 01 010 00 10

s(n) 0 1 2 3 4 5 6 7s(n)binary 000 001 010 011 100 101 110 111

(pointer,bit) (, 1) (0, 0) (01, 1) (10, 1) (100, 0) (010, 0) (001, 0)

• Asymptotically compress down to the entropy of the source

(not in practice)

Summary (1/2)

• Fixed-length block codes (Chapter 4)

– Only a tiny fraction of source strings are given an encoding

– Identify entropy as the measure of compressibility

– No practical use

• Symbol codes (Chapter 5)

– Variable code lengths allow lossless compression

– Expected code length is H + DKL (between the source

distribution and the code’s implicit distribution)

– DKL can be made smaller than 1 bit per symbol

– Huffman code is the optimal symbol code

Summary (2/2)

• Stream codes (Chapter 6)

– Arithmetic coding combines a probabilistic model with an

encoding algorithm

– Lempel-Ziv memorises strings that have already occured

– If any of the bits is altered by noise, the rest of the encoding fails

Exercises

• 6.19 (entropy and information)

• 4.16 (Shannon source coding theorem)

• 6.16 (Huffman coding)

• 6.7 (Arithmetic coding)

t-61.182 information theory and machine learning data

Documents