t-61.182 information theory and machine learning data

45
T-61.182 Information Theory and Machine Learning Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, 2004

Upload: others

Post on 24-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: T-61.182 Information Theory and Machine Learning Data

T-61.182 Information Theory and Machine Learning

Data Compression (Chapters 4-6)

presented by Tapani Raiko

Feb 26, 2004

Page 2: T-61.182 Information Theory and Machine Learning Data

Contents (Data Compression)

Chap. 4 Chap. 5 Chap. 6

Data Block Symbol Stream

Lossy? Lossy Lossless Lossless

Result Shannon’s source Huffman coding Arithmetic coding

coding theorem algorithm algorithm

Page 3: T-61.182 Information Theory and Machine Learning Data

Weighting Problem (What is information?)

• 12 balls, all equal in weight except for one

• Two-pan balance to use

• Determine which is the odd ball and whether it is heavier or lighter

• As few uses of the balance as possible!

• The outcome of a random experiment is guaranteed to be most

informative if the probability distribution over outcomes is uniform

Page 4: T-61.182 Information Theory and Machine Learning Data

. An optimal solution to the weighing problem. At each step there are two boxes: the

1+

2+

3+

4+

5+

6+

7+

8+

9+

10+

11+

12+

1−

2−

3−

4−

5−

6−

7−

8−

9−

10−

11−

12−

1 2 3 4

5 6 7 8

weigh

����������������

BBBBBBBBBBBBBBBN

-

1+

2+

3+

4+

5−

6−

7−

8−

1 2 6

3 4 5

weigh

1−

2−

3−

4−

5+

6+

7+

8+

1 2 6

3 4 5

weigh

9+

10+

11+

12+

9−

10−

11−

12−

9 10 11

1 2 3

weigh

������

AAAAAU

-

������

AAAAAU

-

������

AAAAAU

-

1+2+5−1

2

3+4+6−3

4

7−8−1

7

6+3−4−3

4

1−2−5+ 1

2

7+8+ 7

1

9+10+11+ 9

10

9−10−11−9

10

12+12−12

1

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

���

@@R

-

1+

2+

5−

3+

4+

6−

7−

8−

?

4−

3−

6+

2−

1−

5+

7+

8+

?

9+

10+

11+

10−

9−

11−

12+

12−

?

Page 5: T-61.182 Information Theory and Machine Learning Data

Definitions

• Shannon information content:

h(x = ai) ≡ log2

1

pi

• Entropy:

H(X) =∑

i

pi log2

1

pi

• Both are additive for independent variables

0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1 p

h(p) = log2

1

pp h(p) H2(p)

0.001 10.0 0.0110.01 6.6 0.0810.1 3.3 0.470.2 2.3 0.720.5 1.0 1.0

H2(p)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 p

Figure 4.1 shows the Shannon information content of an outcome with prob-

Page 6: T-61.182 Information Theory and Machine Learning Data

Game of Submarine

• Player hides a submarine in one square of an 8 by 8 grid

• Another player trys to hit it

A

B

C

D

E

F

G

H

87654321

×j ×

×j

×

j

××××××××

××××××××

××××××××

×

×

××××

×

×

×××××××××

××××××××

××××××××

×

×

××××

× ×

×

××××××××

×××××××j

×

×××××××××

××××××××

××××××××

×

×

××××

× ×

×

××××××××

×××××××jS

move # 1 2 32 48 49question G3 B1 E5 F3 H3outcome x = n x = n x = n x = n x = y

P (x)63

64

62

63

32

33

16

17

1

16

h(x) 0.0227 0.0230 0.0443 0.0874 4.0

Total info. 0.0227 0.0458 1.0 2.0 6.0

• Compare to asking 6 yes/no questions about the location

Page 7: T-61.182 Information Theory and Machine Learning Data

Raw Bit Content

• A binary name is given to each outcome of a random variable X

• The length of the names would be log2 |AX |

(assuming |AX | happens to be a power of 2)

• Define: The raw bit content of X is

H0(X) = log2 |AX |

• Simply counts the possible outcomes - no compression yet

• Additive: H0(X, Y ) = H0(X) + H0(Y )

Page 8: T-61.182 Information Theory and Machine Learning Data

Lossy Compression

• Let

AX = { a, b, c, d, e, f, g, h }

PX ={

1

4, 1

4, 1

4, 3

16, 1

64, 1

64, 1

64, 1

64

}

• The raw bit content is 3 bits (8 binary

names)

• If we are willing to run a risk of δ = 1/16

of not having a name for x, then we can

get by with 2 bits (4 names)

δ = 0

x c(x)

a 000

b 001

c 010

d 011

e 100

f 101

g 110

h 111

δ = 1/16

x c(x)

a 00

b 01

c 10

d 11

e −f −g −h −

Page 9: T-61.182 Information Theory and Machine Learning Data

-log

2P (x)

−2−2.4−4−6

S0S 1

16

a,b,cde,f,g,h

666

The outcomes of X ranked by their probability

Page 10: T-61.182 Information Theory and Machine Learning Data

Essential Bit Content

• Allow an error with probability δ

• Choose the smallest sufficient subset Sδ such that

P (x ∈ Sδ) ≥ 1 − δ

(arrange the elements of AX in order of decreasing probability and

take enough from beginning)

• Define: The essential bit content of X is

Hδ(X) = log2 |Sδ|

• Note that the raw bit content H0 is a special case of Hδ

Page 11: T-61.182 Information Theory and Machine Learning Data

Hδ(X)

0

0.5

1

1.5

2

2.5

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

{a,b}

{a,b,c}

{a,b,c,d}

{a,b,c,d,e}

{a,b,c,d,e,f}

{a}

{a,b,c,d,e,f,g}{a,b,c,d,e,f,g,h}

δ

The essential bit content as the function of allowed probability of error

Page 12: T-61.182 Information Theory and Machine Learning Data

Extended Ensembles (Blocks)

• Consider a tuple of N i.i.d. random variables

• Denote by XN the ensemble (X1, X2, . . . , XN )

• Entropy is additive: H(XN) = NH(X)

• Example: N flips of a bent coin: p0 = 0.9, p1 = 0.1

Page 13: T-61.182 Information Theory and Machine Learning Data

-

log2 P (x)

0−2−4−6−8−10−12−14

S0.01 S0.1

00000010, 0001, . . .0110, 1010, . . .1101, 1011, . . .1111

66666

Outcomes of the bent coin ensemble X4

Page 14: T-61.182 Information Theory and Machine Learning Data

Hδ(X4)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

N=4

δ

Essential bit content of the bent coin ensemble X4

Page 15: T-61.182 Information Theory and Machine Learning Data

Hδ(X10)

0

2

4

6

8

10

0 0.2 0.4 0.6 0.8 1

N=10

δ

Essential bit content of the bent coin ensemble X10

Page 16: T-61.182 Information Theory and Machine Learning Data

1N

Hδ(XN )

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

N=10N=210N=410N=610N=810

N=1010

δEssential bit content per toss

Page 17: T-61.182 Information Theory and Machine Learning Data

Shannon’s Source Coding Theorem

Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that

for N > N0,∣

1

NHδ(X

N) − H(X)

< ε.

1

NHδ(X

N )

H0(X)

0 1 δ

H − ε

H

H + ε

• Proof involves

– Law of large numbers

– Chebyshev’s inequality

Page 18: T-61.182 Information Theory and Machine Learning Data

x log2(P (x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11... −50.1

......................1.....1.....1.......1....1.........1.....................................1.... −37.3

........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. −65.91.1...1................1.......................11.1..1............................1.....1..1.11..... −56.4...11...........1...1.....1.1......1..........1....1...1.....1............1......................... −53.2..............1......1.........1.1.......1..........1............1...1......................1....... −43.7.....1........1.......1...1............1............1...........1......1..11........................ −46.8.....1..1..1...............111...................1...............1.........1.1...1...1.............1 −56.4.........1..........1.....1......1..........1....1..............................................1... −37.3......1........................1..............1.....1..1.1.1..1...................................1. −43.71.......................1..........1...1...................1....1....1........1..11..1.1...1........ −56.4...........11.1.........1................1......1.....................1............................. −37.3.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... −56.4......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. −59.5............11.1......1....1..1............................1.......1..............1.......1......... −46.8

.................................................................................................... −15.21111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −332.1

Some samples from X100. Compare to H(X100) = 46.9 bits.

Page 19: T-61.182 Information Theory and Machine Learning Data

Typicality

• A string contains r 1s and N − r 0s

• Consider r as a random variable (binomial distribution)

• Mean and std: r ∼ Np1 ±√

Np1(1 − p1)

• A typical string is a one with r ' Np1

• In general, information content within N [H(X) ± β]

log2

1

P (x)= N

i

pi log2

1

pi

' NH(X)

Page 20: T-61.182 Information Theory and Machine Learning Data

N = 100 N = 1000

n(r) =(

Nr

)

0

2e+28

4e+28

6e+28

8e+28

1e+29

1.2e+29

0 10 20 30 40 50 60 70 80 90 1000

5e+298

1e+299

1.5e+299

2e+299

2.5e+299

3e+299

0 100 200 300 400 500 600 700 800 9001000

log2 P (x)

-350

-300

-250

-200

-150

-100

-50

0

0 10 20 30 40 50 60 70 80 90 100

T

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

0 100 200 300 400 500 600 700 800 9001000

T

n(r)P (x) =(

Nr

)

pr1(1− p1)

N−r

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 10 20 30 40 50 60 70 80 90 1000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 100 200 300 400 500 600 700 800 9001000

r r

Anatomy of the typical set T

Page 21: T-61.182 Information Theory and Machine Learning Data

-

log2 P (x)

−NH(X)

TNβ

66666

0000000000000. . . 00000000000

0001000000000. . . 00000000000

0100000001000. . . 00010000000

0000100000010. . . 00001000010

1111111111110. . . 11111110111

Outcomes of XN ranked by their probability and the typical set TNβ

Page 22: T-61.182 Information Theory and Machine Learning Data

Shannon’s source coding theorem(verbal statement)

N i.i.d. random variables each with entropy H(X) can be compressed

into more than NH(X) bits with negligible risk of information loss, as

N → ∞;

conversely if they are compressed into fewer than NH(X) bits it is

virtually certain that information will be lost.

Page 23: T-61.182 Information Theory and Machine Learning Data

End of Chapter 4

Chap. 4 Chap. 5 Chap. 6

Data Block Symbol Stream

Lossy? Lossy Lossless Lossless

Result Shannon’s source Huffman coding Arithmetic coding

coding theorem algorithm algorithm

Page 24: T-61.182 Information Theory and Machine Learning Data

Contents, Chap. 5: Symbol Codes

• Lossless coding: shorter encodings to the more probable outcomes

and longer encodings to the less probable

• Practical to decode?

• Best achievable compression?

• Source coding theorem (symbol codes):

The expected length L(C, X) ∈ [H(X), H(X) + 1).

• Huffman coding algorithm

Page 25: T-61.182 Information Theory and Machine Learning Data

Definitions

• A (binary) symbol code is a mapping from Ax to {0, 1}+

• c(x) is the codeword of x and l(x) its length

• Extended code c+(x1x2 . . . xN ) = c(x1)c(x2) . . . c(xN )

(no punctuation)

• A code C(X) is uniquely decodable if no two distinct strings have

the same encoding

• A symbol code is called a prefix code if no codeword is a prefix of

any other codeword (constraining to prefix codes doesn’t lose any

performance)

Page 26: T-61.182 Information Theory and Machine Learning Data

Examples

AX = {a, b, c, d},

PX =

{

1

2,1

4,1

8,1

8

}

,

• Using C0:

c+(acdba) = 10000010000101001000

• Code C1 = {0, 101} is a prefix code so it

can be represented as a tree

• Code C2 = {1, 101} is a not prefix code

because 1 is a prefix of 101

C0:

ai c(ai) li

a 1000 4b 0100 4c 0010 4d 0001 4

C1

0

1

0

0

1 101

Page 27: T-61.182 Information Theory and Machine Learning Data

Expected length

• Expected length L(C, X) of a symbol code C for ensemble X is

L(C, X) =∑

x∈AX

P (x) l(x).

• Bounded below by H(X) (uniquely decodeable code)

• Equal to H(X) only if the codelengths are equal to the Shannon

information contents:

li = log2(1/pi)

• Codelengths implicitly define a probability distribution {qi}

qi ≡ 2−li

Page 28: T-61.182 Information Theory and Machine Learning Data

Examples

C3

0

1

0

100

1

1 111

0 110

C4

0

1

0

1

00

0

1

01

10

11

eighingcan

states.

• L(C3, X) = 1.75 = H(X)

• L(C4, X) = 2 > H(X)

• L(C5, X) = 1.25 < H(X)

C3:

ai c(ai) pi h(pi) li

a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3

C4 C5

a 00 0

b 01 1

c 10 00

d 11 11

Page 29: T-61.182 Information Theory and Machine Learning Data

Example

C6:

ai c(ai) pi h(pi) li

a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3

• L(C6, X) = 1.75 = H(X)

• C6 is not a prefix code but is in fact uniquely decodable

Page 30: T-61.182 Information Theory and Machine Learning Data

Kraft Inequality

• If a code is uniquely decodeable its lengths must satisfy∑

i

2−li ≤ 1

• For any lengths satisfying the Kraft inequality, there exists a prefix

code with those lengths

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

The

tota

l sy

mbol

code

budget

Page 31: T-61.182 Information Theory and Machine Learning Data

Source coding theorem for symbol codes

• By setting

li = dlog2(1/pi)e,

where dl∗e denotes the smallest integer greater than or equal to l∗,

we get (with Kraft’s inequality):

• There exists a prefix code C with

H(X) ≤ L(C, X) < H(X) + 1

• Relative entropy DKL(p||q) measures how many bits per symbol are

wasted

L(C, X) =∑

i

pi log(1/qi) = H(X) + DKL(p||q)

Page 32: T-61.182 Information Theory and Machine Learning Data

Huffman Coding Algorithm

1. Take two least probable symbols in the alphabet

2. Give them the longest codewords differing only in the last digit

3. Combine them into a single symbol and repeatP { }

0.25

0.25

0.2

0.15

0.15

0.25

0.25

0.2

0.3

0.25

0.45

0.3

0.55

0.45

1.0a

b

c

d

e

0

1

0

1

0

1

0

1

��

��

���

��

��

x step 1 step 2 step 3 step 4 ai pi h(pi) li c(ai)

a 0.25 2.0 2 00

b 0.25 2.0 2 10

c 0.2 2.3 2 11

d 0.15 2.7 3 010

e 0.15 2.7 3 011

Page 33: T-61.182 Information Theory and Machine Learning Data

Optimality

• Huffman coding is optimal in two senses:

– Smallest expected codelength of uniquely decodable symbol

codes

– Prefix code → easy to decode

• But:

– The overhead of between 0 and 1 bits per symbol is important if

H(X) is small → compress blocks of symbols to make H(X)

larger

– Does not take context into account (symbol code vs. stream

code)

Page 34: T-61.182 Information Theory and Machine Learning Data

End of Chapter 5

Chap. 4 Chap. 5 Chap. 6

Data Block Symbol Stream

Lossy? Lossy Lossless Lossless

Result Shannon’s source Huffman coding Arithmetic coding

coding theorem algorithm algorithm

Page 35: T-61.182 Information Theory and Machine Learning Data

Guessing Game

• Human was asked to guess a sentence character by character

• The numbers of guesses are listed below each character

T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E -

1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1

• One could encode only the string 1, 1, 1, 5, 1, . . .

• Decoding requires an identical twin who also plays the guessing

game

Page 36: T-61.182 Information Theory and Machine Learning Data

Arithmetic Coding (1/2)

• Human predictor is replaced by a probabilistic model of the source

• The model supplies a predictive distribution over the next symbol

• It can handle complex adaptive models (context-dependent)

• Binary strings define real intervals

within the real line [0, 1)

• The string 01 corresponds to

[0.01, 0.10) in binary or [0.25, 0.50)

in base ten

0.00

0.25

0.50

0.75

1.00

6

?

0

6

?

1

6?01

01101�

Page 37: T-61.182 Information Theory and Machine Learning Data

Arithmetic Coding (2/2)

• Divide the real line [0, 1) into I intervals of lengths equal to the

probabilities P (x1 = ai)to the probabilities P (x1 =ai), as shown in figure 6.2.

0.00

P (x1 =a1)

P (x1 =a1) + P (x1 =a2)

P (x1 =a1) + . . . + P (x1 =aI−1)

1.0

...

6?a1

6

?

a2

6?aI

...

a2a5�

a2a1�

• Pick an interval and subdivide it (and iterate)

• Send a binary string whose interval lies within that interval

Page 38: T-61.182 Information Theory and Machine Learning Data

Example: Bent Coin (1/3)

• Coin sides are a and b, and the ’end of file’ symbol is �

• Use a Bayesian model with a uniform prior over probabilities of

outcomes

Context(sequence thus far) Probability of next symbol

P (a)=0.425 P (b)=0.425 P (2)=0.15

b P (a | b)=0.28 P (b | b)=0.57 P (2 | b)=0.15

bb P (a | bb)=0.21 P (b | bb)=0.64 P (2 | bb)=0.15

bbb P (a | bbb)=0.17 P (b | bbb)=0.68 P (2 | bbb)=0.15

bbba P (a | bbba)=0.28 P (b | bbba)=0.57 P (2 | bbba)=0.15

Page 39: T-61.182 Information Theory and Machine Learning Data

a

b

2

ba

bb

b2

bba

bbb

bb2

bbba

bbbb

bbb2

0

1

00

01

000

001

0000

0001

00000

00001

00010

00011

0010

0011

00100

00101

00110

00111

010

011

0100

0101

01000

01001

01010

01011

0110

0111

01100

01101

01110

01111

10

11

100

101

1000

1001

10000

10001

10010

10011

1010

1011

10100

10101

10110

10111

110

111

1100

1101

11000

11001

11010

11011

1110

1111

11100

11101

11110

11111

����

BBBB

100111101CCCCO

bbba

bbbaa

bbbab

bbba2

10011

10010111

10011000

10011001

10011010

10011011

10011100

10011101

10011110

10011111

10100000

Figure 6.4. Illustration of thearithmetic coding process as thesequence bbba2 is transmitted.

Page 40: T-61.182 Information Theory and Machine Learning Data

a

b

2

aa

ab

a2

aaa

aab

aa2

aaaa

aaab

aabaaabb

aba

abb

ab2

abaaabababbaabbb

ba

bb

b2

baa

bab

ba2

baaabaabbabababb

bba

bbb

bb2

bbaabbab

bbba

bbbb

0

1

00

01

000

001

0000

0001

00000

00001

00010

00011

0010

0011

00100

00101

00110

00111

010

011

0100

0101

01000

01001

01010

01011

0110

0111

01100

01101

01110

01111

10

11

100

101

1000

1001

10000

10001

10010

10011

1010

1011

10100

10101

10110

10111

110

111

1100

1101

11000

11001

11010

11011

1110

1111

11100

11101

11110

11111

Page 41: T-61.182 Information Theory and Machine Learning Data

On Arithmetic Coding

• Computationally efficient

• Length of a string closely matches the Shannon information content

• Overhead required to terminate a message is never more than 2 bits

⇒ Finding a good coding equivalent to finding a good probabilistic

model!

• Flexible:

– any source alphabet and any encoded alphabet

– alphabets can change with time

– probabilities are context-dependent

• Can be used to generate random samples from random bits

economically

Page 42: T-61.182 Information Theory and Machine Learning Data

Lempel-Ziv Coding

• Used in gzip etc.

source substrings λ 1 0 11 01 010 00 10

s(n) 0 1 2 3 4 5 6 7s(n)binary 000 001 010 011 100 101 110 111

(pointer,bit) (, 1) (0, 0) (01, 1) (10, 1) (100, 0) (010, 0) (001, 0)

• Asymptotically compress down to the entropy of the source

(not in practice)

Page 43: T-61.182 Information Theory and Machine Learning Data

Summary (1/2)

• Fixed-length block codes (Chapter 4)

– Only a tiny fraction of source strings are given an encoding

– Identify entropy as the measure of compressibility

– No practical use

• Symbol codes (Chapter 5)

– Variable code lengths allow lossless compression

– Expected code length is H + DKL (between the source

distribution and the code’s implicit distribution)

– DKL can be made smaller than 1 bit per symbol

– Huffman code is the optimal symbol code

Page 44: T-61.182 Information Theory and Machine Learning Data

Summary (2/2)

• Stream codes (Chapter 6)

– Arithmetic coding combines a probabilistic model with an

encoding algorithm

– Lempel-Ziv memorises strings that have already occured

– If any of the bits is altered by noise, the rest of the encoding fails

Page 45: T-61.182 Information Theory and Machine Learning Data

Exercises

• 6.19 (entropy and information)

• 4.16 (Shannon source coding theorem)

• 6.16 (Huffman coding)

• 6.7 (Arithmetic coding)