t-61.182 information theory and machine learning data
TRANSCRIPT
![Page 1: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/1.jpg)
T-61.182 Information Theory and Machine Learning
Data Compression (Chapters 4-6)
presented by Tapani Raiko
Feb 26, 2004
![Page 2: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/2.jpg)
Contents (Data Compression)
Chap. 4 Chap. 5 Chap. 6
Data Block Symbol Stream
Lossy? Lossy Lossless Lossless
Result Shannon’s source Huffman coding Arithmetic coding
coding theorem algorithm algorithm
![Page 3: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/3.jpg)
Weighting Problem (What is information?)
• 12 balls, all equal in weight except for one
• Two-pan balance to use
• Determine which is the odd ball and whether it is heavier or lighter
• As few uses of the balance as possible!
• The outcome of a random experiment is guaranteed to be most
informative if the probability distribution over outcomes is uniform
![Page 4: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/4.jpg)
. An optimal solution to the weighing problem. At each step there are two boxes: the
1+
2+
3+
4+
5+
6+
7+
8+
9+
10+
11+
12+
1−
2−
3−
4−
5−
6−
7−
8−
9−
10−
11−
12−
1 2 3 4
5 6 7 8
weigh
����������������
BBBBBBBBBBBBBBBN
-
1+
2+
3+
4+
5−
6−
7−
8−
1 2 6
3 4 5
weigh
1−
2−
3−
4−
5+
6+
7+
8+
1 2 6
3 4 5
weigh
9+
10+
11+
12+
9−
10−
11−
12−
9 10 11
1 2 3
weigh
������
AAAAAU
-
������
AAAAAU
-
������
AAAAAU
-
1+2+5−1
2
3+4+6−3
4
7−8−1
7
6+3−4−3
4
1−2−5+ 1
2
7+8+ 7
1
9+10+11+ 9
10
9−10−11−9
10
12+12−12
1
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
���
@@R
-
1+
2+
5−
3+
4+
6−
7−
8−
?
4−
3−
6+
2−
1−
5+
7+
8+
?
9+
10+
11+
10−
9−
11−
12+
12−
?
![Page 5: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/5.jpg)
Definitions
• Shannon information content:
h(x = ai) ≡ log2
1
pi
• Entropy:
H(X) =∑
i
pi log2
1
pi
• Both are additive for independent variables
0
2
4
6
8
10
0 0.2 0.4 0.6 0.8 1 p
h(p) = log2
1
pp h(p) H2(p)
0.001 10.0 0.0110.01 6.6 0.0810.1 3.3 0.470.2 2.3 0.720.5 1.0 1.0
H2(p)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1 p
Figure 4.1 shows the Shannon information content of an outcome with prob-
![Page 6: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/6.jpg)
Game of Submarine
• Player hides a submarine in one square of an 8 by 8 grid
• Another player trys to hit it
A
B
C
D
E
F
G
H
87654321
×j ×
×j
×
j
××××××××
××××××××
××××××××
×
×
××××
×
×
×××××××××
××××××××
××××××××
×
×
××××
× ×
×
××××××××
×××××××j
×
×××××××××
××××××××
××××××××
×
×
××××
× ×
×
××××××××
×××××××jS
move # 1 2 32 48 49question G3 B1 E5 F3 H3outcome x = n x = n x = n x = n x = y
P (x)63
64
62
63
32
33
16
17
1
16
h(x) 0.0227 0.0230 0.0443 0.0874 4.0
Total info. 0.0227 0.0458 1.0 2.0 6.0
• Compare to asking 6 yes/no questions about the location
![Page 7: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/7.jpg)
Raw Bit Content
• A binary name is given to each outcome of a random variable X
• The length of the names would be log2 |AX |
(assuming |AX | happens to be a power of 2)
• Define: The raw bit content of X is
H0(X) = log2 |AX |
• Simply counts the possible outcomes - no compression yet
• Additive: H0(X, Y ) = H0(X) + H0(Y )
![Page 8: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/8.jpg)
Lossy Compression
• Let
AX = { a, b, c, d, e, f, g, h }
PX ={
1
4, 1
4, 1
4, 3
16, 1
64, 1
64, 1
64, 1
64
}
• The raw bit content is 3 bits (8 binary
names)
• If we are willing to run a risk of δ = 1/16
of not having a name for x, then we can
get by with 2 bits (4 names)
δ = 0
x c(x)
a 000
b 001
c 010
d 011
e 100
f 101
g 110
h 111
δ = 1/16
x c(x)
a 00
b 01
c 10
d 11
e −f −g −h −
![Page 9: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/9.jpg)
-log
2P (x)
−2−2.4−4−6
S0S 1
16
a,b,cde,f,g,h
666
The outcomes of X ranked by their probability
![Page 10: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/10.jpg)
Essential Bit Content
• Allow an error with probability δ
• Choose the smallest sufficient subset Sδ such that
P (x ∈ Sδ) ≥ 1 − δ
(arrange the elements of AX in order of decreasing probability and
take enough from beginning)
• Define: The essential bit content of X is
Hδ(X) = log2 |Sδ|
• Note that the raw bit content H0 is a special case of Hδ
![Page 11: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/11.jpg)
Hδ(X)
0
0.5
1
1.5
2
2.5
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
{a,b}
{a,b,c}
{a,b,c,d}
{a,b,c,d,e}
{a,b,c,d,e,f}
{a}
{a,b,c,d,e,f,g}{a,b,c,d,e,f,g,h}
δ
The essential bit content as the function of allowed probability of error
![Page 12: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/12.jpg)
Extended Ensembles (Blocks)
• Consider a tuple of N i.i.d. random variables
• Denote by XN the ensemble (X1, X2, . . . , XN )
• Entropy is additive: H(XN) = NH(X)
• Example: N flips of a bent coin: p0 = 0.9, p1 = 0.1
![Page 13: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/13.jpg)
-
log2 P (x)
0−2−4−6−8−10−12−14
S0.01 S0.1
00000010, 0001, . . .0110, 1010, . . .1101, 1011, . . .1111
66666
Outcomes of the bent coin ensemble X4
![Page 14: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/14.jpg)
Hδ(X4)
0
0.5
1
1.5
2
2.5
3
3.5
4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
N=4
δ
Essential bit content of the bent coin ensemble X4
![Page 15: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/15.jpg)
Hδ(X10)
0
2
4
6
8
10
0 0.2 0.4 0.6 0.8 1
N=10
δ
Essential bit content of the bent coin ensemble X10
![Page 16: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/16.jpg)
1N
Hδ(XN )
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
N=10N=210N=410N=610N=810
N=1010
δEssential bit content per toss
![Page 17: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/17.jpg)
Shannon’s Source Coding Theorem
Given ε > 0 and 0 < δ < 1, there exists a positive integer N0 such that
for N > N0,∣
∣
∣
∣
1
NHδ(X
N) − H(X)
∣
∣
∣
∣
< ε.
1
NHδ(X
N )
H0(X)
0 1 δ
H − ε
H
H + ε
• Proof involves
– Law of large numbers
– Chebyshev’s inequality
![Page 18: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/18.jpg)
x log2(P (x))
...1...................1.....1....1.1.......1........1...........1.....................1.......11... −50.1
......................1.....1.....1.......1....1.........1.....................................1.... −37.3
........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. −65.91.1...1................1.......................11.1..1............................1.....1..1.11..... −56.4...11...........1...1.....1.1......1..........1....1...1.....1............1......................... −53.2..............1......1.........1.1.......1..........1............1...1......................1....... −43.7.....1........1.......1...1............1............1...........1......1..11........................ −46.8.....1..1..1...............111...................1...............1.........1.1...1...1.............1 −56.4.........1..........1.....1......1..........1....1..............................................1... −37.3......1........................1..............1.....1..1.1.1..1...................................1. −43.71.......................1..........1...1...................1....1....1........1..11..1.1...1........ −56.4...........11.1.........1................1......1.....................1............................. −37.3.1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... −56.4......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. −59.5............11.1......1....1..1............................1.......1..............1.......1......... −46.8
.................................................................................................... −15.21111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 −332.1
Some samples from X100. Compare to H(X100) = 46.9 bits.
![Page 19: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/19.jpg)
Typicality
• A string contains r 1s and N − r 0s
• Consider r as a random variable (binomial distribution)
• Mean and std: r ∼ Np1 ±√
Np1(1 − p1)
• A typical string is a one with r ' Np1
• In general, information content within N [H(X) ± β]
log2
1
P (x)= N
∑
i
pi log2
1
pi
' NH(X)
![Page 20: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/20.jpg)
N = 100 N = 1000
n(r) =(
Nr
)
0
2e+28
4e+28
6e+28
8e+28
1e+29
1.2e+29
0 10 20 30 40 50 60 70 80 90 1000
5e+298
1e+299
1.5e+299
2e+299
2.5e+299
3e+299
0 100 200 300 400 500 600 700 800 9001000
log2 P (x)
-350
-300
-250
-200
-150
-100
-50
0
0 10 20 30 40 50 60 70 80 90 100
T
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 100 200 300 400 500 600 700 800 9001000
T
n(r)P (x) =(
Nr
)
pr1(1− p1)
N−r
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 10 20 30 40 50 60 70 80 90 1000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 100 200 300 400 500 600 700 800 9001000
r r
Anatomy of the typical set T
![Page 21: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/21.jpg)
-
log2 P (x)
−NH(X)
TNβ
66666
0000000000000. . . 00000000000
0001000000000. . . 00000000000
0100000001000. . . 00010000000
0000100000010. . . 00001000010
1111111111110. . . 11111110111
Outcomes of XN ranked by their probability and the typical set TNβ
![Page 22: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/22.jpg)
Shannon’s source coding theorem(verbal statement)
N i.i.d. random variables each with entropy H(X) can be compressed
into more than NH(X) bits with negligible risk of information loss, as
N → ∞;
conversely if they are compressed into fewer than NH(X) bits it is
virtually certain that information will be lost.
![Page 23: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/23.jpg)
End of Chapter 4
Chap. 4 Chap. 5 Chap. 6
Data Block Symbol Stream
Lossy? Lossy Lossless Lossless
Result Shannon’s source Huffman coding Arithmetic coding
coding theorem algorithm algorithm
![Page 24: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/24.jpg)
Contents, Chap. 5: Symbol Codes
• Lossless coding: shorter encodings to the more probable outcomes
and longer encodings to the less probable
• Practical to decode?
• Best achievable compression?
• Source coding theorem (symbol codes):
The expected length L(C, X) ∈ [H(X), H(X) + 1).
• Huffman coding algorithm
![Page 25: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/25.jpg)
Definitions
• A (binary) symbol code is a mapping from Ax to {0, 1}+
• c(x) is the codeword of x and l(x) its length
• Extended code c+(x1x2 . . . xN ) = c(x1)c(x2) . . . c(xN )
(no punctuation)
• A code C(X) is uniquely decodable if no two distinct strings have
the same encoding
• A symbol code is called a prefix code if no codeword is a prefix of
any other codeword (constraining to prefix codes doesn’t lose any
performance)
![Page 26: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/26.jpg)
Examples
AX = {a, b, c, d},
PX =
{
1
2,1
4,1
8,1
8
}
,
• Using C0:
c+(acdba) = 10000010000101001000
• Code C1 = {0, 101} is a prefix code so it
can be represented as a tree
• Code C2 = {1, 101} is a not prefix code
because 1 is a prefix of 101
C0:
ai c(ai) li
a 1000 4b 0100 4c 0010 4d 0001 4
C1
0
1
0
0
1 101
![Page 27: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/27.jpg)
Expected length
• Expected length L(C, X) of a symbol code C for ensemble X is
L(C, X) =∑
x∈AX
P (x) l(x).
• Bounded below by H(X) (uniquely decodeable code)
• Equal to H(X) only if the codelengths are equal to the Shannon
information contents:
li = log2(1/pi)
• Codelengths implicitly define a probability distribution {qi}
qi ≡ 2−li
![Page 28: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/28.jpg)
Examples
C3
0
1
0
100
1
1 111
0 110
C4
0
1
0
1
00
0
1
01
10
11
eighingcan
states.
• L(C3, X) = 1.75 = H(X)
• L(C4, X) = 2 > H(X)
• L(C5, X) = 1.25 < H(X)
C3:
ai c(ai) pi h(pi) li
a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3
C4 C5
a 00 0
b 01 1
c 10 00
d 11 11
![Page 29: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/29.jpg)
Example
C6:
ai c(ai) pi h(pi) li
a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3
• L(C6, X) = 1.75 = H(X)
• C6 is not a prefix code but is in fact uniquely decodable
![Page 30: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/30.jpg)
Kraft Inequality
• If a code is uniquely decodeable its lengths must satisfy∑
i
2−li ≤ 1
• For any lengths satisfying the Kraft inequality, there exists a prefix
code with those lengths
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
111
110
101
100
011
010
001
000
11
10
01
00
0
1
The
tota
l sy
mbol
code
budget
![Page 31: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/31.jpg)
Source coding theorem for symbol codes
• By setting
li = dlog2(1/pi)e,
where dl∗e denotes the smallest integer greater than or equal to l∗,
we get (with Kraft’s inequality):
• There exists a prefix code C with
H(X) ≤ L(C, X) < H(X) + 1
• Relative entropy DKL(p||q) measures how many bits per symbol are
wasted
L(C, X) =∑
i
pi log(1/qi) = H(X) + DKL(p||q)
![Page 32: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/32.jpg)
Huffman Coding Algorithm
1. Take two least probable symbols in the alphabet
2. Give them the longest codewords differing only in the last digit
3. Combine them into a single symbol and repeatP { }
0.25
0.25
0.2
0.15
0.15
0.25
0.25
0.2
0.3
0.25
0.45
0.3
0.55
0.45
1.0a
b
c
d
e
0
1
0
1
0
1
0
1
��
��
���
��
��
x step 1 step 2 step 3 step 4 ai pi h(pi) li c(ai)
a 0.25 2.0 2 00
b 0.25 2.0 2 10
c 0.2 2.3 2 11
d 0.15 2.7 3 010
e 0.15 2.7 3 011
![Page 33: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/33.jpg)
Optimality
• Huffman coding is optimal in two senses:
– Smallest expected codelength of uniquely decodable symbol
codes
– Prefix code → easy to decode
• But:
– The overhead of between 0 and 1 bits per symbol is important if
H(X) is small → compress blocks of symbols to make H(X)
larger
– Does not take context into account (symbol code vs. stream
code)
![Page 34: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/34.jpg)
End of Chapter 5
Chap. 4 Chap. 5 Chap. 6
Data Block Symbol Stream
Lossy? Lossy Lossless Lossless
Result Shannon’s source Huffman coding Arithmetic coding
coding theorem algorithm algorithm
![Page 35: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/35.jpg)
Guessing Game
• Human was asked to guess a sentence character by character
• The numbers of guesses are listed below each character
T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E -
1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1
• One could encode only the string 1, 1, 1, 5, 1, . . .
• Decoding requires an identical twin who also plays the guessing
game
![Page 36: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/36.jpg)
Arithmetic Coding (1/2)
• Human predictor is replaced by a probabilistic model of the source
• The model supplies a predictive distribution over the next symbol
• It can handle complex adaptive models (context-dependent)
• Binary strings define real intervals
within the real line [0, 1)
• The string 01 corresponds to
[0.01, 0.10) in binary or [0.25, 0.50)
in base ten
0.00
0.25
0.50
0.75
1.00
6
?
0
6
?
1
6?01
01101�
![Page 37: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/37.jpg)
Arithmetic Coding (2/2)
• Divide the real line [0, 1) into I intervals of lengths equal to the
probabilities P (x1 = ai)to the probabilities P (x1 =ai), as shown in figure 6.2.
0.00
P (x1 =a1)
P (x1 =a1) + P (x1 =a2)
P (x1 =a1) + . . . + P (x1 =aI−1)
1.0
...
6?a1
6
?
a2
6?aI
...
a2a5�
a2a1�
• Pick an interval and subdivide it (and iterate)
• Send a binary string whose interval lies within that interval
![Page 38: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/38.jpg)
Example: Bent Coin (1/3)
• Coin sides are a and b, and the ’end of file’ symbol is �
• Use a Bayesian model with a uniform prior over probabilities of
outcomes
Context(sequence thus far) Probability of next symbol
P (a)=0.425 P (b)=0.425 P (2)=0.15
b P (a | b)=0.28 P (b | b)=0.57 P (2 | b)=0.15
bb P (a | bb)=0.21 P (b | bb)=0.64 P (2 | bb)=0.15
bbb P (a | bbb)=0.17 P (b | bbb)=0.68 P (2 | bbb)=0.15
bbba P (a | bbba)=0.28 P (b | bbba)=0.57 P (2 | bbba)=0.15
![Page 39: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/39.jpg)
a
b
2
ba
bb
b2
bba
bbb
bb2
bbba
bbbb
bbb2
0
1
00
01
000
001
0000
0001
00000
00001
00010
00011
0010
0011
00100
00101
00110
00111
010
011
0100
0101
01000
01001
01010
01011
0110
0111
01100
01101
01110
01111
10
11
100
101
1000
1001
10000
10001
10010
10011
1010
1011
10100
10101
10110
10111
110
111
1100
1101
11000
11001
11010
11011
1110
1111
11100
11101
11110
11111
����
BBBB
100111101CCCCO
bbba
bbbaa
bbbab
bbba2
10011
10010111
10011000
10011001
10011010
10011011
10011100
10011101
10011110
10011111
10100000
Figure 6.4. Illustration of thearithmetic coding process as thesequence bbba2 is transmitted.
![Page 40: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/40.jpg)
a
b
2
aa
ab
a2
aaa
aab
aa2
aaaa
aaab
aabaaabb
aba
abb
ab2
abaaabababbaabbb
ba
bb
b2
baa
bab
ba2
baaabaabbabababb
bba
bbb
bb2
bbaabbab
bbba
bbbb
0
1
00
01
000
001
0000
0001
00000
00001
00010
00011
0010
0011
00100
00101
00110
00111
010
011
0100
0101
01000
01001
01010
01011
0110
0111
01100
01101
01110
01111
10
11
100
101
1000
1001
10000
10001
10010
10011
1010
1011
10100
10101
10110
10111
110
111
1100
1101
11000
11001
11010
11011
1110
1111
11100
11101
11110
11111
![Page 41: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/41.jpg)
On Arithmetic Coding
• Computationally efficient
• Length of a string closely matches the Shannon information content
• Overhead required to terminate a message is never more than 2 bits
⇒ Finding a good coding equivalent to finding a good probabilistic
model!
• Flexible:
– any source alphabet and any encoded alphabet
– alphabets can change with time
– probabilities are context-dependent
• Can be used to generate random samples from random bits
economically
![Page 42: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/42.jpg)
Lempel-Ziv Coding
• Used in gzip etc.
source substrings λ 1 0 11 01 010 00 10
s(n) 0 1 2 3 4 5 6 7s(n)binary 000 001 010 011 100 101 110 111
(pointer,bit) (, 1) (0, 0) (01, 1) (10, 1) (100, 0) (010, 0) (001, 0)
• Asymptotically compress down to the entropy of the source
(not in practice)
![Page 43: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/43.jpg)
Summary (1/2)
• Fixed-length block codes (Chapter 4)
– Only a tiny fraction of source strings are given an encoding
– Identify entropy as the measure of compressibility
– No practical use
• Symbol codes (Chapter 5)
– Variable code lengths allow lossless compression
– Expected code length is H + DKL (between the source
distribution and the code’s implicit distribution)
– DKL can be made smaller than 1 bit per symbol
– Huffman code is the optimal symbol code
![Page 44: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/44.jpg)
Summary (2/2)
• Stream codes (Chapter 6)
– Arithmetic coding combines a probabilistic model with an
encoding algorithm
– Lempel-Ziv memorises strings that have already occured
– If any of the bits is altered by noise, the rest of the encoding fails
![Page 45: T-61.182 Information Theory and Machine Learning Data](https://reader031.vdocuments.us/reader031/viewer/2022020622/61edf9f5a29d7f48a751881a/html5/thumbnails/45.jpg)
Exercises
• 6.19 (entropy and information)
• 4.16 (Shannon source coding theorem)
• 6.16 (Huffman coding)
• 6.7 (Arithmetic coding)