chapter 5: data compression · 2018. 8. 21. · 106 data compression deﬁnition a code is called a...

University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 5: Data compression


Chapter 5 outline

• 12 balls weighing problem

• Examples of codes

• Kraft inequality

• Optimal codes + bounds

• Kraft inequality for uniquely decodable codes

• Huffman codes

• Shannon-Fano-Elias coding

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

About Chapter 4

In this chapter we discuss how to measure the information content of theoutcome of a random experiment.

This chapter has some tough bits. If you find the mathematical detailshard, skim through them and keep going – you’ll be able to enjoy Chapters 5and 6 without this chapter’s tools.

Notation

x ∈ A x is a member of theset A

S ⊂ A S is a subset of theset A

S ⊆ A S is a subset of, orequal to, the set A

V = B ∪A V is the union of thesets B and A

V = B ∩A V is the intersectionof the sets B and A

|A| number of elementsin set A

Before reading Chapter 4, you should have read Chapter 2 and worked onexercises 2.21–2.25 and 2.16 (pp.36–37), and exercise 4.1 below.

The following exercise is intended to help you think about how to measureinformation content.

Exercise 4.1.[2, p.69] – Please work on this problem before reading Chapter 4.

You are given 12 balls, all equal in weight except for one that is eitherheavier or lighter. You are also given a two-pan balance to use. In eachuse of the balance you may put any number of the 12 balls on the leftpan, and the same number on the right pan, and push a button to initiatethe weighing; there are three possible outcomes: either the weights areequal, or the balls on the left are heavier, or the balls on the left arelighter. Your task is to design a strategy to determine which is the oddball and whether it is heavier or lighter than the others in as few usesof the balance as possible.

While thinking about this problem, you may find it helpful to considerthe following questions:

(a) How can one measure information?

(b) When you have identified the odd ball and whether it is heavy orlight, how much information have you gained?

(c) Once you have designed a strategy, draw a tree showing, for eachof the possible outcomes of a weighing, what weighing you performnext. At each node in the tree, how much information have theoutcomes so far given you, and how much information remains tobe gained?

(d) How much information is gained when you learn (i) the state of aflipped coin; (ii) the states of two flipped coins; (iii) the outcomewhen a four-sided die is rolled?

(e) How much information is gained on the first step of the weighingproblem if 6 balls are weighed against the other 6? How much isgained if 4 are weighed against 4 on the first step, leaving out 4balls?

66


12 balls weighing: 1 lighter or heavier

• Total information contained?

• Each weighing gives you how much information (ideally)?

• Number of weighings needed?

• Strategy?

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye



4.1: How to measure the information content of a random variable? 69

Figure 4.2. An optimal solution to the weighing problem. At each step there are two boxes: the leftbox shows which hypotheses are still possible; the right box shows the balls involved in thenext weighing. The 24 hypotheses are written 1+, . . . , 12−, with, e.g., 1+ denoting that1 is the odd ball and it is heavy. Weighings are written by listing the names of the ballson the two pans, separated by a line; for example, in the first weighing, balls 1, 2, 3, and4 are put on the left-hand side and 5, 6, 7, and 8 on the right. In each triplet of arrowsthe upper arrow leads to the situation when the left side is heavier, the middle arrow tothe situation when the right side is heavier, and the lower arrow to the situation when theoutcome is balanced. The three points labelled ⋆ correspond to impossible outcomes.

1+

2+

3+

4+

5+

6+

7+

8+

9+

10+

11+

12+

1−2−3−4−5−6−7−8−9−10−11−12−

1 2 3 45 6 7 8

weigh

✂✂✂✂✂✂✂✂✂✂✂✂✂✂✂✍

❇❇❇❇❇❇❇❇❇❇❇❇❇❇❇◆

✲

1+

2+

3+

4+

5−6−7−8−

1 2 63 4 5

weigh

1−2−3−4−5+

6+

7+

8+

1 2 63 4 5

weigh

9+

10+

11+

12+

9−10−11−12−

9 10 111 2 3

weigh

✁✁✁✁✁✕

❆❆❆❆❆❯

✲

✁✁✁✁✁✕

❆❆❆❆❆❯

✲

✁✁✁✁✁✕

❆❆❆❆❆❯

✲

1+2+5− 12

3+4+6− 34

7−8− 17

6+3−4− 34

1−2−5+ 12

7+8+ 71

9+10+11+ 910

9−10−11− 910

12+12− 121

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

**✒

❅❅❘

✲

1+

2+

5−

3+

4+

6−

7−

8−

⋆

4−

3−

6+

2−

1−

5+

7+

8+

⋆

9+

10+

11+

10−

9−

11−

12+

12−

⋆

[Mackay textbook pg. 69]


Examples of codes

What is X?

What is D*?

What is D?

What is H(X)?

What is L(C)?

Decode 0110111100110


Examples of codes


Examples of codes

Meaning in lay terms?

106 DATA COMPRESSION

Definition A code is called a prefix code or an instantaneous code ifno codeword is a prefix of any other codeword.

An instantaneous code can be decoded without reference to future code-words since the end of a codeword is immediately recognizable. Hence,for an instantaneous code, the symbol xi can be decoded as soon as wecome to the end of the codeword corresponding to it. We need not waitto see the codewords that come later. An instantaneous code is a self-punctuating code; we can look down the sequence of code symbols andadd the commas to separate the codewords without looking at later sym-bols. For example, the binary string 01011111010 produced by the codeof Example 5.1.1 is parsed as 0,10,111,110,10.

The nesting of these definitions is shown in Figure 5.1. To illustrate thedifferences between the various kinds of codes, consider the examples ofcodeword assignments C(x) to x ∈ X in Table 5.1. For the nonsingularcode, the code string 010 has three possible source sequences: 2 or 14 or31, and hence the code is not uniquely decodable. The uniquely decodablecode is not prefix-free and hence is not instantaneous. To see that it isuniquely decodable, take any code string and start from the beginning.If the first two bits are 00 or 10, they can be decoded immediately. If

Allcodes

Nonsingularcodes

Uniquelydecodable

codes

Instantaneouscodes

FIGURE 5.1. Classes of codes.

5.2 KRAFT INEQUALITY 107

TABLE 5.1 Classes of Codes

Nonsingular, But Not Uniquely Decodable,X Singular Uniquely Decodable But Not Instantaneous Instantaneous

1 0 0 10 02 0 010 00 103 0 01 11 1104 0 10 110 111

the first two bits are 11, we must look at the following bits. If the nextbit is a 1, the first source symbol is a 3. If the length of the string of0’s immediately following the 11 is odd, the first codeword must be 110and the first source symbol must be 4; if the length of the string of 0’s iseven, the first source symbol is a 3. By repeating this argument, we can seethat this code is uniquely decodable. Sardinas and Patterson [455] havedevised a finite test for unique decodability, which involves forming setsof possible suffixes to the codewords and eliminating them systematically.The test is described more fully in Problem 5.5.27. The fact that the lastcode in Table 5.1 is instantaneous is obvious since no codeword is a prefixof any other.

5.2 KRAFT INEQUALITY

We wish to construct instantaneous codes of minimum expected length todescribe a given source. It is clear that we cannot assign short codewordsto all source symbols and still be prefix-free. The set of codeword lengthspossible for instantaneous codes is limited by the following inequality.

Theorem 5.2.1 (Kraft inequality) For any instantaneous code (prefixcode) over an alphabet of size D, the codeword lengths l1, l2, . . . , lm mustsatisfy the inequality

!

i

D−li ≤ 1. (5.6)

Conversely, given a set of codeword lengths that satisfy this inequality,there exists an instantaneous code with these word lengths.

Proof: Consider a D-ary tree in which each node has D children. Let thebranches of the tree represent the symbols of the codeword. For example,the D branches arising from the root node represent the D possible valuesof the first symbol of the codeword. Then each codeword is represented


Review Source Codes Kraft Inequality Optimal Codes Shannon-Fano-Elias Code

Code Tree

Prefix code: C (A, B, C , D) = (00, 11, 100, 101)

Form a D-ary Tree

D branches at each nodeEach node along the path to a leaf is a prefix of the leaf ⇒ can’t be leaf itselfSome leaves may be unused

0

D

C

B

A

1

0

0

1

1

1

0

B. Smida (ES250) Data Compression Fall 2008-09 9 / 22

Code trees


Kraft inequality

• Want short, prefix codes. Kraft inequality quantifies tradeoff.


Code tree for Kraft inequality108 DATA COMPRESSION

Root

0

10

110

111

FIGURE 5.2. Code tree for the Kraft inequality.

by a leaf on the tree. The path from the root traces out the symbols of thecodeword. A binary example of such a tree is shown in Figure 5.2. Theprefix condition on the codewords implies that no codeword is an ancestorof any other codeword on the tree. Hence, each codeword eliminates itsdescendants as possible codewords.

Let lmax be the length of the longest codeword of the set of codewords.Consider all nodes of the tree at level lmax. Some of them are codewords,some are descendants of codewords, and some are neither. A codewordat level li has Dlmax−li descendants at level lmax. Each of these descendantsets must be disjoint. Also, the total number of nodes in these sets mustbe less than or equal to Dlmax . Hence, summing over all the codewords,we have

!Dlmax−li ≤ Dlmax (5.7)

or!

D−li ≤ 1, (5.8)

which is the Kraft inequality.Conversely, given any set of codeword lengths l1, l2, . . . , lm that sat-

isfy the Kraft inequality, we can always construct a tree like the one in


Kraft inequality and code budgetsCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981


96 5 — Symbol Codes

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

The

tota

l sy

mbol

code

budget

Figure 5.1. The symbol codingbudget. The ‘cost’ 2−l of eachcodeword (with length l) isindicated by the size of the box itis written in. The total budgetavailable when making a uniquelydecodeable code is 1.You can think of this diagram asshowing a codeword supermarket,with the codewords arranged inaisles by their length, and the costof each codeword indicated by thesize of its box on the shelf. If thecost of the codewords that youtake exceeds the budget then yourcode will not be uniquelydecodeable.

C0 C3 C4 C6

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

Figure 5.2. Selections ofcodewords made by codesC0, C3, C4 and C6 from section5.1.


Kraft inequality and code budgets




5.1 Symbol codes

A (binary) symbol code C for an ensemble X is a mapping from the rangeof x, AX = {a1, . . . , aI}, to {0, 1}+. c(x) will denote the codeword cor-responding to x, and l(x) will denote its length, with li = l(ai).

The extended code C+ is a mapping from A+X to {0, 1}+ obtained by

concatenation, without punctuation, of the corresponding codewords:

c+(x1x2 . . . xN ) = c(x1)c(x2) . . . c(xN ). (5.1)

[The term ‘mapping’ here is a synonym for ‘function’.]

Example 5.3. A symbol code for the ensemble X defined by

AX = { a, b, c, d },PX = { 1/2, 1/4, 1/8, 1/8 }, (5.2)

is C0, shown in the margin. C0:

ai c(ai) li

a 1000 4b 0100 4c 0010 4d 0001 4

Using the extended code, we may encode acdbac as

c+(acdbac) = 100000100001010010000010. (5.3)

There are basic requirements for a useful symbol code. First, any encodedstring must have a unique decoding. Second, the symbol code must be easy todecode. And third, the code should achieve as much compression as possible.

Any encoded string must have a unique decoding

A code C(X) is uniquely decodeable if, under the extended code C+, notwo distinct strings have the same encoding, i.e.,

∀x,y ∈ A+X , x ̸= y ⇒ c+(x) ̸= c+(y). (5.4)

The code C0 defined above is an example of a uniquely decodeable code.

The symbol code must be easy to decode

A symbol code is easiest to decode if it is possible to identify the end of acodeword as soon as it arrives, which means that no codeword can be a prefixof another codeword. [A word c is a prefix of another word d if there exists atail string t such that the concatenation ct is identical to d. For example, 1 isa prefix of 101, and so is 10.]

We will show later that we don’t lose any performance if we constrain oursymbol code to be a prefix code.

A symbol code is called a prefix code if no codeword is a prefix of anyother codeword.

A prefix code is also known as an instantaneous or self-punctuating code,because an encoded string can be decoded from left to right withoutlooking ahead to subsequent codewords. The end of a codeword is im-mediately recognizable. A prefix code is uniquely decodeable.

Prefix codes are also known as ‘prefix-free codes’ or ‘prefix condition codes’.

Prefix codes correspond to trees.



5.1: Symbol codes 93

C1

0

1

0

01 101

Example 5.4. The code C1 = {0, 101} is a prefix code because 0 is not a prefixof 101, nor is 101 a prefix of 0.

Example 5.5. Let C2 = {1, 101}. This code is not a prefix code because 1 is aprefix of 101.

Example 5.6. The code C3 = {0, 10, 110, 111} is a prefix code.

C3

0

1

0

100

1

1 111

0 110

C4

0

1

0

1

00

0

1

01

10

11

Prefix codes can be representedon binary trees. Complete prefixcodes correspond to binary treeswith no unused branches. C1 is anincomplete code.


Exercise 5.8.[1, p.104] Is C2 uniquely decodeable?

Example 5.9. Consider exercise 4.1 (p.66) and figure 4.2 (p.69). Any weighingstrategy that identifies the odd ball and whether it is heavy or light canbe viewed as assigning a ternary code to each of the 24 possible states.This code is a prefix code.

The code should achieve as much compression as possible

The expected length L(C,X) of a symbol code C for ensemble X is

L(C,X) =!

x∈AX

P (x) l(x). (5.5)

We may also write this quantity as

L(C,X) =I!

i=1

pili (5.6)

where I = |AX |.C3:

ai c(ai) pi h(pi) li

a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3

Example 5.10. LetAX = { a, b, c, d },

and PX = { 1/2, 1/4, 1/8, 1/8 }, (5.7)

and consider the code C3. The entropy of X is 1.75 bits, and the expectedlength L(C3,X) of this code is also 1.75 bits. The sequence of symbolsx=(acdbac) is encoded as c+(x) = 0110111100110. C3 is a prefix codeand is therefore uniquely decodeable. Notice that the codeword lengthssatisfy li = log2(1/pi), or equivalently, pi =2−li .

Example 5.11. Consider the fixed length code for the same ensemble X, C4.The expected length L(C4,X) is 2 bits.

C4 C5

a 00 0b 01 1c 10 00d 11 11Example 5.12. Consider C5. The expected length L(C5,X) is 1.25 bits, which

is less than H(X). But the code is not uniquely decodeable. The se-quence x=(acdbac) encodes as 000111000, which can also be decodedas (cabdca).

Example 5.13. Consider the code C6. The expected length L(C6,X) of this

C6:


a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3

code is 1.75 bits. The sequence of symbols x=(acdbac) is encoded asc+(x) = 0011111010011.

Is C6 a prefix code? It is not, because c(a) = 0 is a prefix of both c(b)and c(c).




C1

0

1

0

01 101




C3

0

1

0

100

1

1 111

0 110

C4

0

1

0

1

00

0

1

01

10

11







L(C,X) =!

x∈AX

P (x) l(x). (5.5)


L(C,X) =I!

i=1

pili (5.6)

where I = |AX |.C3:


a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3


and PX = { 1/2, 1/4, 1/8, 1/8 }, (5.7)



C4 C5




C6:


a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3






C1

0

1

0

01 101




C3

0

1

0

100

1

1 111

0 110

C4

0

1

0

1

00

0

1

01

10

11







L(C,X) =!

x∈AX

P (x) l(x). (5.5)


L(C,X) =I!

i=1

pili (5.6)

where I = |AX |.C3:


a 0 1/2 1.0 1b 10 1/4 2.0 2c 110 1/8 3.0 3d 111 1/8 3.0 3


and PX = { 1/2, 1/4, 1/8, 1/8 }, (5.7)



C4 C5




C6:


a 0 1/2 1.0 1b 01 1/4 2.0 2c 011 1/8 3.0 3d 111 1/8 3.0 3






1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

Th

e to

tal

sym

bo

l co

de

bu

dg

et

Figure 5.1. The symbol codingbudget. The ‘cost’ 2−l of eachcodeword (with length l) isindicated by the size of the box itis written in. The total budgetavailable when making a uniquelydecodeable code is 1.You can think of this diagram asshowing a codeword supermarket,with the codewords arranged inaisles by their length, and the costof each codeword indicated by thesize of its box on the shelf. If thecost of the codewords that youtake exceeds the budget then yourcode will not be uniquelydecodeable.

C0 C3 C4 C6

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

1111

1110

1101

1100

1011

1010

1001

1000

0111

0110

0101

0100

0011

0010

0001

0000

111

110

101

100

011

010

001

000

11

10

01

00

0

1

Figure 5.2. Selections ofcodewords made by codesC0, C3, C4 and C6 from section5.1.✓ ✕✓ ✓

✕

[Mackay textbook, Ch.5]


Kraft inequality example

What about L = {2;2;3;3;3;3}?

What about L = {2;2;2;3;3;3}?


Extended Kraft inequality


Kraft inequality for uniquely decodable codes


Optimal codes

• Optimal code = prefix code that minimizes the expected codeword length. Solution to:


Bounds on optimal code length


Block coding


Entropy rate and code length


The wrong distribution

• Design code for source distribution q(x) but true distribution is p(x).

• Can we quantify the loss in the expected length of the `wrong’ code?


Shannon code


Shannon code example


Shannon code competitive optimality


Huffman codes

• Huffman discovered a simple algorithm for constructing optimal (shortest expected length) codes for a given any distribution.

5.6 HUFFMAN CODES 119

CodewordLength Codeword X Probability

2 01 1 0.25 0.3 0.45 0.55 12 10 2 0.25 0.25 0.3 0.452 11 3 0.2 0.25 0.253 000 4 0.15 0.23 001 5 0.15

This code has average length 2.3 bits.

Example 5.6.2 Consider a ternary code for the same random variable.Now we combine the three least likely symbols into one supersymbol andobtain the following table:

Codeword X Probability

1 1 0.25 0.5 12 2 0.25 0.2500 3 0.2 0.2501 4 0.1502 5 0.15

This code has an average length of 1.5 ternary digits.

Example 5.6.3 If D ≥ 3, we may not have a sufficient number of sym-bols so that we can combine them D at a time. In such a case, we adddummy symbols to the end of the set of symbols. The dummy symbolshave probability 0 and are inserted to fill the tree. Since at each stage ofthe reduction, the number of symbols is reduced by D − 1, we want thetotal number of symbols to be 1 + k(D − 1), where k is the number ofmerges. Hence, we add enough dummy symbols so that the total numberof symbols is of this form. For example:



simple fix to the proof. Any subset of a uniquely decodable code is alsouniquely decodable; thus, any finite subset of the infinite set of codewordssatisfies the Kraft inequality. Hence,

∞!

i=1

D−li = limN→∞

N!

i=1

D−li ≤ 1. (5.60)

Given a set of word lengths l1, l2, . . . that satisfy the Kraft inequality, wecan construct an instantaneous code as in Section 5.4. Since instantaneouscodes are uniquely decodable, we have constructed a uniquely decodablecode with an infinite number of codewords. So the McMillan theoremalso applies to infinite alphabets. !

The theorem implies a rather surprising result—that the class ofuniquely decodable codes does not offer any further choices for the setof codeword lengths than the class of prefix codes. The set of achievablecodeword lengths is the same for uniquely decodable and instantaneouscodes. Hence, the bounds derived on the optimal codeword lengths con-tinue to hold even when we expand the class of allowed codes to the classof all uniquely decodable codes.

5.6 HUFFMAN CODES

An optimal (shortest expected length) prefix code for a given distributioncan be constructed by a simple algorithm discovered by Huffman [283].We will prove that any other code for the same alphabet cannot have alower expected length than the code constructed by the algorithm. Beforewe give any formal proofs, let us introduce Huffman codes with someexamples.

Example 5.6.1 Consider a random variable X taking values in the setX = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respec-tively. We expect the optimal binary code for X to have the longestcodewords assigned to the symbols 4 and 5. These two lengths must beequal, since otherwise we can delete a bit from the longer codeword andstill have a prefix code, but with a shorter expected length. In general,we can construct a code in which the two longest codewords differ onlyin the last bit. For this code, we can combine the symbols 4 and 5 intoa single source symbol, with a probability assignment 0.30. Proceedingthis way, combining the two least likely symbols into one symbol untilwe are finally left with only one symbol, and then assigning codewordsto the symbols, we obtain the following table:



2 01 1 0.25 0.3 0.45 0.55 12 10 2 0.25 0.25 0.3 0.452 11 3 0.2 0.25 0.253 000 4 0.15 0.23 001 5 0.15




1 1 0.25 0.5 12 2 0.25 0.2500 3 0.2 0.2501 4 0.1502 5 0.15





simple fix to the proof. Any subset of a uniquely decodable code is alsouniquely decodable; thus, any finite subset of the infinite set of codewordssatisfies the Kraft inequality. Hence,

∞!

i=1

D−li = limN→∞

N!

i=1

D−li ≤ 1. (5.60)

Given a set of word lengths l1, l2, . . . that satisfy the Kraft inequality, wecan construct an instantaneous code as in Section 5.4. Since instantaneouscodes are uniquely decodable, we have constructed a uniquely decodablecode with an infinite number of codewords. So the McMillan theoremalso applies to infinite alphabets. !

The theorem implies a rather surprising result—that the class ofuniquely decodable codes does not offer any further choices for the setof codeword lengths than the class of prefix codes. The set of achievablecodeword lengths is the same for uniquely decodable and instantaneouscodes. Hence, the bounds derived on the optimal codeword lengths con-tinue to hold even when we expand the class of allowed codes to the classof all uniquely decodable codes.

5.6 HUFFMAN CODES

An optimal (shortest expected length) prefix code for a given distributioncan be constructed by a simple algorithm discovered by Huffman [283].We will prove that any other code for the same alphabet cannot have alower expected length than the code constructed by the algorithm. Beforewe give any formal proofs, let us introduce Huffman codes with someexamples.

Example 5.6.1 Consider a random variable X taking values in the setX = {1, 2, 3, 4, 5} with probabilities 0.25, 0.25, 0.2, 0.15, 0.15, respec-tively. We expect the optimal binary code for X to have the longestcodewords assigned to the symbols 4 and 5. These two lengths must beequal, since otherwise we can delete a bit from the longer codeword andstill have a prefix code, but with a shorter expected length. In general,we can construct a code in which the two longest codewords differ onlyin the last bit. For this code, we can combine the symbols 4 and 5 intoa single source symbol, with a probability assignment 0.30. Proceedingthis way, combining the two least likely symbols into one symbol untilwe are finally left with only one symbol, and then assigning codewordsto the symbols, we obtain the following table:

H(X), L(C)?


Huffman codes



2 01 1 0.25 0.3 0.45 0.55 12 10 2 0.25 0.25 0.3 0.452 11 3 0.2 0.25 0.253 000 4 0.15 0.23 001 5 0.15




1 1 0.25 0.5 12 2 0.25 0.2500 3 0.2 0.2501 4 0.1502 5 0.15




Purpose of dummy symbols?

Number of dummy symbols?


Huffman code of English languageCopyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981



ai pi log21pi

li c(ai)

a 0.0575 4.1 4 0000b 0.0128 6.3 6 001000c 0.0263 5.2 5 00101d 0.0285 5.1 5 10000e 0.0913 3.5 4 1100f 0.0173 5.9 6 111000g 0.0133 6.2 6 001001h 0.0313 5.0 5 10001i 0.0599 4.1 4 1001j 0.0006 10.7 10 1101000000k 0.0084 6.9 7 1010000l 0.0335 4.9 5 11101m 0.0235 5.4 6 110101n 0.0596 4.1 4 0001o 0.0689 3.9 4 1011p 0.0192 5.7 6 111001q 0.0008 10.3 9 110100001r 0.0508 4.3 5 11011s 0.0567 4.1 4 0011t 0.0706 3.8 4 1111u 0.0334 4.9 5 10101v 0.0069 7.2 8 11010001w 0.0119 6.4 7 1101001x 0.0073 7.1 7 1010001y 0.0164 5.9 6 101001z 0.0007 10.4 10 1101000001– 0.1928 2.4 2 01

pf

i

oe

t

uy

r

l

s−

wvq

m

an

c

dh

gb

kx

jz

Figure 5.6. Huffman code for theEnglish language ensemble(monogram statistics).

It is not the case, however, that optimal codes can always be constructedby a greedy top-down method in which the alphabet is successively dividedinto subsets that are as near as possible to equiprobable.

Example 5.18. Find the optimal binary symbol code for the ensemble:

AX = { a, b, c, d, e, f, g }PX = { 0.01, 0.24, 0.05, 0.20, 0.47, 0.01, 0.02 } . (5.24)

Notice that a greedy top-down method can split this set into two sub-sets {a, b, c, d} and {e, f, g} which both have probability 1/2, and that{a, b, c, d} can be divided into subsets {a, b} and {c, d}, which have prob-ability 1/4; so a greedy top-down method gives the code shown in thethird column of table 5.7, which has expected length 2.53. The Huffman

ai pi Greedy Huffman

a .01 000 000000b .24 001 01c .05 010 0001d .20 011 001e .47 10 1f .01 110 000001g .02 111 00001

Table 5.7. A greedily-constructedcode compared with the Huffmancode.

coding algorithm yields the code shown in the fourth column, which hasexpected length 1.97. ✷

5.6 Disadvantages of the Huffman code

The Huffman algorithm produces an optimal symbol code for an ensemble,but this is not the end of the story. Both the word ‘ensemble’ and the phrase‘symbol code’ need careful attention.

Changing ensemble

If we wish to communicate a sequence of outcomes from one unchanging en-semble, then a Huffman code may be convenient. But often the appropriate

[Mackay textbook, Ch.5]


Huffman codes

C is a “canonical” code!C is a Huffman code?


Huffman codes124 DATA COMPRESSION

00

1

1

1

1

1

0

0

0

p1

p3

p4

p2

p5

(a) (b)

0

1

1

1

1

0

0

0

p1

p3

p4

p2

p5

(c) (d )

1

1

1

0

01

00 p5

p2

p1

p3

p4

1

1

1

0

01

00 p2

p2

p3

p4

p5

FIGURE 5.3. Properties of optimal codes. We assume that p1 ≥ p2 ≥ · · · ≥ pm. A possibleinstantaneous code is given in (a). By trimming branches without siblings, we improve thecode to (b). We now rearrange the tree as shown in (c), so that the word lengths are orderedby increasing length from top to bottom. Finally, we swap probability assignments to improvethe expected depth of the tree, as shown in (d ). Every optimal code can be rearranged andswapped into canonical form as in (d ), where l1 ≤ l2 ≤ · · · ≤ lm and lm−1 = lm, and the lasttwo codewords differ only in the last bit.

But pj − pk > 0, and since Cm is optimal, L(C ′m) − L(Cm) ≥ 0.

Hence, we must have lk ≥ lj . Thus, Cm itself satisfies property 1.• The two longest codewords are of the same length. Here we trim the

codewords. If the two longest codewords are not of the same length,one can delete the last bit of the longer one, preserving the prefixproperty and achieving lower expected codeword length. Hence, thetwo longest codewords must have the same length. By property 1, thelongest codewords must belong to the least probable source symbols.

• The two longest codewords differ only in the last bit and correspondto the two least likely symbols. Not all optimal codes satisfy thisproperty, but by rearranging, we can find an optimal code that does.If there is a maximal-length codeword without a sibling, we can deletethe last bit of the codeword and still satisfy the prefix property. Thisreduces the average codeword length and contradicts the optimality

[Cover+Thomas, pg.124]


Constructing Huffman codes

• Huffman code obtained by repeatedly ``merging" the last two symbols, assigning to them the ``last codeword minus the last bit", and reordering the symbols in order to have non-increasing probabilities or weights.

5.7 SOME COMMENTS ON HUFFMAN CODES 121

2. Huffman coding for weighted codewords. Huffman’s algorithm forminimizing

!pili can be applied to any set of numbers pi ≥ 0,

regardless of!

pi . In this case, the Huffman code minimizes thesum of weighted code lengths

!wili rather than the average code

length.

Example 5.7.1 We perform the weighted minimization using thesame algorithm.

In this case the code minimizes the weighted sum of the codewordlengths, and the minimum weighted sum is 36.

3. Huffman coding and “slice” questions (Alphabetic codes). We havedescribed the equivalence of source coding with the game of 20questions. The optimal sequence of questions corresponds to anoptimal source code for the random variable. However, Huffmancodes ask arbitrary questions of the form “Is X ∈ A?” for any setA ⊆ {1, 2, . . . , m}.

Now we consider the game “20 questions” with a restricted setof questions. Specifically, we assume that the elements of X ={1, 2, . . . , m} are ordered so that p1 ≥ p2 ≥ · · · ≥ pm and that theonly questions allowed are of the form “Is X > a?” for some a. TheHuffman code constructed by the Huffman algorithm may not cor-respond to slices (sets of the form {x : x < a}). If we take the code-word lengths (l1 ≤ l2 ≤ · · · ≤ lm, by Lemma 5.8.1) derived from theHuffman code and use them to assign the symbols to the code treeby taking the first available node at the corresponding level, wewill construct another optimal code. However, unlike the Huffmancode itself, this code is a slice code, since each question (each bitof the code) splits the tree into sets of the form {x : x > a} and{x : x < a}.

We illustrate this with an example.

Example 5.7.2 Consider the first example of Section 5.6. Thecode that was constructed by the Huffman coding procedure is not a


Comments on Huffman codes

• Equivalence of source coding and 20 questions?

• Huffman coding versus Shannon coding?

• Strengths?

• Weaknesses?


Rigorous proof of Huffman optimality

chapter 5: data compression · 2018. 8. 21. · 106 data compression deﬁnition a code is called a...

Documents