ahc.pdf
TRANSCRIPT
-
7/30/2019 AHC.pdf
1/12
Chapter 3
HUFFMAN CODING
Yeuan-Kuen Lee[ MCU, CSIE ]
Ch 3 Huffman Coding 2
Outline
3.1 Overview
3.2 The Huffman Coding Algorithm3.2.1 Minimum Variance Huffman Codes
3.2.2 Optimality of Huffman Codes (*)
3.2.3 Length of Huffman Codes (*)
3.2.4 Extended Huffman Codes (*)
3.3 Nonlinear Huffman Codes (*)
3.4 Adaptive Huffman Coding
3.4.1 Update Procedure3.4.2 Encoding Procedure
3.4.3 Decoding Procedure
Ch 3 Huffman Coding 3
Outline
3.5 Golomb Codes
3.6 Rice Codes
3.6.1 CCSDS Recommendation for Lossless Compression
3.7 Tunstall Codes3.8 Applications of Huffman Coding
3.8.1 Lossless Image Compression
3.8.2 Text Compression
3.8.3 Audio Compression
3.9 Summary
3.10 Projects and Problems
Ch 3 Huffman Coding 4
3.1 Overview
In this chapter, we describe a very popular coding algorithm
called the Huffman coding algorithm:
9 Present a procedure for building Huffman codes
when the probability model for the source is known.
9 A procedure for building codes when the source statistics are unknown
9 Describe a new technique for code design that are in some sense similar
to the Huffman coding approach
9 Some applications
-
7/30/2019 AHC.pdf
2/12
Ch 3 Huffman Coding 5
3.2 The Huffman Coding Algorithm
9 This technique was developed by David Huffman as part of a class assignment;
the class was the first ever in the area of information theory andwas taught by Robert Fano at MIT.
9 The Codes generated using this technique are called Huffman Codes.
9 These Codes are
9 Prefix codes
9 Optimum for a given model ( set of probabilities )
9 Based on two observations regarding optimum prefix codes:
1. In an optimum code, symbols that occurs more frequently ( have a higher
probability of occurrence) will have short codewords than symbols that
occur less frequently.
2. In an optimum code, the two symbols that occur least frequently will have
the same length.
Ch 3 Huffman Coding 6
In an optimum code,
the two symbols that occur least frequently will have the same length.
9 Suppose an optimum code C exists in which the two codewords correspondingto the two least probable symbols do not have the same length.
9 Suppose the longer codeword is k bits longer than the shorter codeword.
9
As these codewords correspond to the least probable symbols in the alphabet,no other codeword can be longer than these codewords;
therefore there is no danger that the shortened codeword would become the
prefix of some other codeword.
3.2 The Huffman Coding Algorithm
k bitsdistinct
Ch 3 Huffman Coding 7
3.2 The Huffman Coding Algorithm
9 Furthermore, by dropping these k bits we obtain a new code that has a
shorter average length than C.9 But, this violates our initial contention that C is an optimal code.9 Therefore, for an optimal code the second observation also holds true.
A simple requirement
A simple requirement
The codewords corresponding to the two lowest probability symbols
differ only in the last bit.
That is, if and are the two least probable symbols in an alphabet,
if the codeword for was m 0, the codeword for would be m 1.
Here, m is a string of 1s and 0s, and denotes concatenation.
Ch 3 Huffman Coding 8
3.2 The Huffman Coding Algorithm
Example 3.2.1 Design of a Huffman Code
Example 3.2.1 Design of a Huffman Code
An alphabetA = { a1 , a2 , a3 , a4 , a5 } withP( a1 ) = P( a3 ) = 0.2
P( a2 ) = 0.4
P( a4 ) = P( a5 ) = 0.1
The entropy = -2 * 0.2 log2 (0.2) - 0.4 log2 (0.4) - 2 * 0.1 log2 (0.1)
= 2.122 bits/symbol
Table 3.1 The initial five-letter alphabet
a2a1a3a4a5
0.40.20.20.10.1
c(a2)c(a1)c(a3)c(a4)c(a5)
Letter Probability CodewordThe two symbols with the lowestprobability are a4 and a5.
c(a4) = 1 0
c(a5) = 1 1
1 is a binary string
-
7/30/2019 AHC.pdf
3/12
Ch 3 Huffman Coding 9
3.2 The Huffman Coding Algorithm
Table 3.2 The reduced four-letter alphabet
a2a1a3a4
0.4
0.2
0.2
0.2
c(a2)
c(a1)
c(a3)
1
Letter Probability Codeword
Define a new alphabetA = { a1 , a2 , a3 , a4 }where
a4 is composed of
a4 and
a5.
P( a4 ) = P( a4 ) + P( a5 ) = 0.2
In this alphabet A ,a3 and a4 are the two letters
at the bottom of the sorted list.
We assign their codewords as
c(a3) = 2 0
c(a4) = 2 1
but c(a4) = 1 .
Therefore, 1 = 2 1
Which mean that
c(a4) = 1 0 = 2 10
c(a5) = 1 1 = 2 11
Ch 3 Huffman Coding 10
3.2 The Huffman Coding Algorithm
We again define a new alphabet A = { a1 , a2 , a3 }where
a3 is composed of
a3 and
a 4.
P( a3 ) = P( a3 ) + P( a4 ) = 0.4
Table 3.3 The reduced three-letter alphabet
a2a3a1
0.4
0.4
0.2
c(a2)
2c(a1)
Letter Probability CodewordIn this case, the least probable
symbols are a3 and a1 .
Therefore,
c(a3) = 3 0
c(a1) = 3 1
but c(a3) = 2 .Therefore, 2 = 3 0
Which mean thatc(a3) = 2 0 = 3 00
c(a4) = 2 10 = 3 010
c(a5) = 2 11 = 3 011
Ch 3 Huffman Coding 11
3.2 The Huffman Coding Algorithm
We again define a new alphabet A = {a3 , a2 }where a3 is composed of a3 and a 1 .
P( a3 ) = P( a3 ) + P( a1 ) = 0.6
Table 3.4 The reduced two-letter alphabet
a3a 2
0.6
0.4
3c(a2)
Letter Probability Codeword
We have only two letters,
The codeword assignment isstraightforward:
c(a3) = 0
c(a2) = 1
but c(a3) = 3 .
Therefore, 3 = 0
Which mean that
c(a1) = 3 1 = 01
c(a3) = 3 00 = 000
c(a4) = 3 010 = 0010
c(a5) = 3 011 = 0011
Ch 3 Huffman Coding 12
3.2 The Huffman Coding Algorithm
Table 3.5 Huffman code for the original five-letter alphabet
a2a1a3a4a5
0.40.2
0.20.10.1
101
00000100011
Letter Probability Codeword
The average length for this code is
l = .4*1 + .2*2 + .2*3 + .1*4 + .1*4 = 2.2 bits/symbol.
A measure of the efficiency of this code is its
redundancy the difference between the entropy and the average length.
In this case, the redundancy = 2.2 2.122 = 0.078 bits/symbol.
-
7/30/2019 AHC.pdf
4/12
Ch 3 Huffman Coding 13
3.2 The Huffman Coding Algorithm
a2 (0.4)
a1 (0.2)
a3 (0.2)
a4 (0.1)
a5 (0.1)
a2 (0.4)
a1 (0.2)
a3 (0.2)
a4 (0.2)
a2 (0.4)
a 1 (0.2)
a3 (0.4)
a3 (0.6)
a2 (0.4)
0
1
0
1
0
1
0
1
Figure 3.1 The Huffman encoding procedure.The symbol probabilities are list in parentheses.
Sorted by probabilities
Ch 3 Huffman Coding 14
3.2 The Huffman Coding Algorithm
Figure 3.2 Building the binary Huffman tree.
a2 (0.4)
0
1
a1 (0.2)
a3 (0.2)
a4 (0.1)
a5 (0.1)
(0.2)
(0.4)
(0.2)
(0.2)
(0.4)
(0.4)
(0.2)
(0.6)
(0.4)(1.0)
0
1
0
1
0
1
We build the binary tree starting at the leaf nodes.
a5
1
1
10
10
a4(0.1) (0.1)
(0.2)
a3(0.2)
(0.4) a1(0.2)
(0.6) a2(0.4)
0
root
Notice the similarity between Figures 3.1 and 3.2.This is not surprising, as they are a result of viewingthe same procedure in two different ways.
Ch 3 Huffman Coding 15
3.2.1 Minimum Variance Huffman Codes
Table 3.2 Reduced four-letter alphabet
a2a1
a3a4
0.4
0.2
0.20.2
c(a2)
c(a1)
c(a3)1
Letter Probability Codeword
Table 3.6 Reduced four-letter alphabet
a2a4a1a3
0.4
0.2
0.2
0.2
c(a2)
1c(a1)
c(a3)
Letter Probability Codeword
Ch 3 Huffman Coding 16
3.2.1 Minimum Variance Huffman Codes
Table 3.7 Reduced three-letter alphabet.
a1
a2a4
0.4
0.40.2
2
c(a2)1
Letter Probability Codeword
Table 3.8 Reduced two-letter alphabet.
a2a1
0.6
0.4
32
Letter Probability Codeword
-
7/30/2019 AHC.pdf
5/12
Ch 3 Huffman Coding 17
3.2.1 Minimum Variance Huffman Codes
Table 3.9 Minimum variance Huffman code
a1a2a3a4a5
0.20.40.20.10.1
100011010011
Letter Probability Codeword
The average length for this code is
l = .4*2 + .2*2 + .2*2 + .1*3 + .1*3 = 2.2 bits/symbol.
These two codes are identical in terms of their redundancy.However, the variance of the length of the codewords is significantly different.
Ch 3 Huffman Coding 18
3.2.1 Minimum Variance Huffman Codes
a2 (0.4)
a1 (0.2)
a3 (0.2)
a4 (0.1)
a5 (0.1)
a2 (0.4)
a4 (0.2)a2 (0.4)
a 4 (0.2)
a1 (0.4) a2 (0.6)
a1 (0.4)
0
1
0
1
0
1
0
1
Sorted by probabilities
a1 (0.2)
a3 (0.2)
Figure 3.3 The minimum variance Huffman encoding procedure.
Ch 3 Huffman Coding 19
3.2.1 Minimum Variance Huffman Codes
a5
1
1
10
10
a4(0.1) (0.1)
(0.2)a3(0.2)
(0.4) a1(0.2)
(0.6) a2(0.4)
0
root
(0.4)
a5
1
10
10
a4(0.1) (0.1)
(0.2) a1(0.2) (0.2)
a2
(0.6)
root
a3
10
0
(0.4)
minimum variance
Figure 3.4 Two Huffman trees corresponding to the same probabilities.
Ch 3 Huffman Coding 20
3.4 Adaptive Huffman Coding
5
3external node, leaf
internal node
Two parameters are added to the binary tree:
1. Weight
2. Node number: unique
The number of times the symbolhas been encountered
Sum of the weight of its offspring
An alphabet of size n,
2n-1 node (internal + external)
Node number : y1, y2, y3, y(2n-1)Weight : x1, x2, x3, x(2n-1) , x1 x2 x3 x(2n-1)Sibling property :
nodes y(2j-1) and y(2j) are sibling for 1 j < n
node number for the parent number is greater than y(2j-1) and y(2j)
-
7/30/2019 AHC.pdf
6/12
Ch 3 Huffman Coding 21
3.4 Adaptive Huffman Coding
transmittertransmitter receiverreceiver
0 0NYT NYT
01101symbols codes
As transmission progresses, nodes corresponding to symbols transmitted will be
added to the tree, and the tree is reconfigured using a update procedure.
initial tree initial tree
Ch 3 Huffman Coding 22
3.4 Adaptive Huffman Coding
Before the beginning of transmission,
a fixed code for each symbol is agreed upon between transmitter and receiver.
If the source has an alphabet ( a1, a2, a3, , am ) of size m ,
then pick e and r such that
m = 2 e + r and 0 r < 2 e .
ex: m = 26, 26 = 24 + 10, e = 4 , r = 10
The letter ak is encoded as
(e+1)-bit binary representation of k-1 if 1 k 2r
e-bit binary representation of k-r-1 , otherwise
ex: a1 [ 1 2*10 ] 1-1 00000 (5 bits)
a2 [ 2 2*10 ] 2-1 00001 (5 bits)
a22 [22 > 2*10 ] 22-10-1 1011 (4 bits)
Ch 3 Huffman Coding 23
3.4 Adaptive Huffman Coding
When a symbol is encountered for the first time,
1. The code for the NYT node is transmitted
2. Followed by the fixed code for the symbol
3. A node for the symbol is created
4. The symbol is taken out of the NYT list.
Both transmitter and receiver
9 Start with the same tree structure
9 Update procedure is identical
Therefore, the encoding and decoding processes remain synchronized.
Ch 3 Huffman Coding 24
3.4.1 Update Procedure
9 The update procedure requires the nodes be in a fixed order.
9 This ordering is preserved by numbering the node.
9 The largest node number is given to the root of the tree, andthe smallest number is assigned to the NYT node.
9 The number from the NYT node to the root are assigned
in increasing order from left to right, and from lower to upper level.
9 The set of nodes with the same weight make up a block.
9 The function of the update procedure is to preserve the sibling property.
-
7/30/2019 AHC.pdf
7/12
Ch 3 Huffman Coding 25
3.4.1 Update Procedure
START
Firstappearancefor symbol?
NYT gives birthto new NYT and
external node
Yes
Increment weightof external node
and old NYT node
Go to oldNYT node
Go to symbolexternal node
Nodenumber max
in block ?
switch node with
highest numberednode in block
A
B
Figure 3.6 (a) Update Procedure
Yes
No
No
C
Ch 3 Huffman Coding 26
3.4.1 Update Procedure
Yes
Incrementnode weight
Is thisthe rootnode ?
Go toparent node
A
B
Figure 3.6 (b) Update Procedure
C
STOP
No
Ch 3 Huffman Coding 27
3.4.1 Update Procedure
Example 3.4.1 Update ProcedureExample 3.4.1 Update Procedure
Message [ a a r d v a r k],
where the alphabet consists of the 26 lowercase letters of the English alphabet.
Total number of node = 2 * 26 1 = 51.
0NYT
initial tree
510NYT
51
1
1
49 50
a
( a )
root
0NYT
51
2
2
49 50
a
( aa )
root
Send a binary code 00000 for a
Since the index of a is 1
0 1
Send 1 for the second a
aa
0 1
Ch 3 Huffman Coding 28
3.4.1 Update Procedure
0NYT
51
2
2
49 50
a
( aa )
root
Old NYT
51
2
3
49 50
a
( aar )
root
0 1NYT
1
47 48
r
Send 0 for NYT node,then send the fixed code 10001 for rSince the index of r is 18So, the fixed code is 10001 (17)
update the tree for r
r
0 1
-
7/30/2019 AHC.pdf
8/12
Ch 3 Huffman Coding 29
3.4.1 Update Procedure
51
2
3
49 50
a
( aar )
root
0 1NYT
1
47 48
r
51
2
4
49 50
a
( aard )
root
1Old NYT
2
47 48
r
0 1NYT
1
45 46
d
Send 00 for NYT node,then send the fixed code 00011 for dSince the index of d is 4So, the fixed code is 00011 (3)
d
update the tree for d
Ch 3 Huffman Coding 30
3.4.1 Update Procedure
51
2
4
49 50
a
( aardv )
root
1
2
47 48
r
1Old NYT
1
45 46
d
51
2
4
49 50
a
( aard )
root
1
2
47 48
r
0 1NYT
1
45 46
d
0 1NYT
1
43 44
v
B
Swap nodes
Send 000 for NYT node,then send the fixed code 1011 for vSince the index of v is 22So, the fixed code is 1011 (11)
v
update the tree for v
Ch 3 Huffman Coding 31
3.4.1 Update Procedure
51
2
4
49 50a
( aardv )
root
2
47 48
1
45 46
d
0 1NYT
1
43 44
v
1 2r
51
2
5
49 50a
( aardv )
root
3
47 48
1
45 46
d
0 1NYT
1
43 44
v
1 2rSwap nodes
Ch 3 Huffman Coding 32
3.4.2 Encoding Procedure
START
Firstappearancefor symbol?
YesSend code for NYTnode followed by
index in the NYT list
Call updateprocedure
Code is the path fromthe root node to thecorresponding node
A
B
Figure 3.8 (a) flowchart of theencoding procedure
No
Read in Symbol
-
7/30/2019 AHC.pdf
9/12
Ch 3 Huffman Coding 33
3.4.2 Encoding Procedure
START
Is this thelast symbol?
Yes
A
B
Figure 3.8 (b) flowchart of theencoding procedure
No
Ch 3 Huffman Coding 34
3.4.2 Encoding Procedure
Example 3.4.2 Encoding procedureExample 3.4.2 Encoding procedure
Message [ a a r d v a r k ]
0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0
NYT NYT NYT
Ch 3 Huffman Coding 35
3.4.3 Decoding Procedure
START
Is the node anexternal node?
Read bit and go tocorresponding node
A
Figure 3.9 (a) flowchart of thedecoding procedure
No
Go to root
of the tree
B
Yes
Ch 3 Huffman Coding 36
3.4.3 Decoding Procedure
Yes
Figure 3.9 (b) flowchart of thedecoding procedure
Is the node
the NYT node? Read e bit
A
Decode elementcorresponding to node
Is the e-bitnumber p less
than r ?Add r to p
Read one more bit
D
C
Yes
No
No
-
7/30/2019 AHC.pdf
10/12
Ch 3 Huffman Coding 37
3.4.3 Decoding Procedure
Figure 3.9 (c) flowchart of thedecoding procedure
Is thisthe last bit ?
Decode the (p+1)element in NYT list
C
Call update procedure
D
Yes
No
START
B
Ch 3 Huffman Coding 38
3.4.3 Decoding Procedure
Example 3.4.3 Decoding procedureExample 3.4.3 Decoding procedure
Message [ a a r d v a r k ]
0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0
NYT NYT NYT
Ch 3 Huffman Coding 39
3.8 Applications of Huffman Coding
3.8.1 Lossless Image Compression
Figure 3.10 Test Images.
Sena Sensin Earth Omaha
256*256 Gray scale raw image.
ftp://ftp.mkp.com/pub/Sayood/uncompressed_software/datasets/images/
Ch 3 Huffman Coding 40
3.8.1 Lossless Image Compression
Table 3.23 Compression using Huffman codes on pixel values.
Image Name Bits/Pixel Total Size(bytes) Compression Ratio
Sena 7.01 57,504 1.14Sensin 7.49 61,430 1.07Earth 4.94 40,534 1.62Omaha 7.12 58,374 1.12
Table 3.24 Compression using Huffman codes on pixel difference values.
Image Name Bits/Pixel Total Size(bytes) Compression Ratio
Sena 4.02 32,968 1.99Sensin 4.70 38,541 1.70Earth 4.13 33,880 1.93Omaha 6.42 52,643 1.24
-
7/30/2019 AHC.pdf
11/12
Ch 3 Huffman Coding 41
3.8.1 Lossless Image Compression
Table 3.25 Compression using adaptive Huffman codes on pixel difference values.
Image Name Bits/Pixel Total Size(bytes) Compression Ratio
Sena 3.93 32,261 2.03Sensin 4.63 37,896 1.73Earth 4.82 39,504 1.66Omaha 6.39 52,321 1.25
Adaptive Huffman coder9 Adv.
9 Can be used as an on-line or real-time coder9 Disadv.
9 More vulnerable to errors9 More difficult to implement
Ch 3 Huffman Coding 42
3.8.2 Text Compression
Table 3.26 Probabilities of occurrence of the lettersin the English alphabet in the U.S. Constitution.
Letter Probability Letter Probability Letter Probability
A 0.057305 J 0.002031 S 0.060289
B 0.014876 K 0.001016 T 0.078085
C 0.025775 L 0.031403 U 0.018474
D 0.026811 M 0.015892 V 0.009882
E 0.112578 N 0.056035 W 0.007576
F 0.022875 O 0.058215 X 0.002264
G 0.009523 P 0.021034 Y 0.011702
H 0.042915 Q 0.000973 Z 0.001502I 0.053475 R 0.048819
Ch 3 Huffman Coding 43
3.8.2 Text Compression
Table 3.27 Probabilities of occurrence of the lettersin the English alphabet in this chapter.
Letter Probability Letter Probability Letter Probability
A 0.049885 J 0.000394 S 0.042657
B 0.016110 K 0.002450 T 0.061142
C 0.025835 L 0.025835 U 0.015794
D 0.030232 M 0.016494 V 0.004988
E 0.097434 N 0.048039 W 0.012207
F 0.019745 O 0.050642 X 0.003413
G 0.012053 P 0.015007 Y 0.008466
H 0.035723 Q 0.001509 Z 0.001050
I 0.048783 R 0.040492
Ch 3 Huffman Coding 44
3.8.2 Text Compression
0
0.02
0.04
0.06
0.080.1
0.12
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
U.S. Constitution Chapter 1
0
0.02
0.04
0.06
0.08
0.1
0.12
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
U.S. Constitution Chapter 1
-
7/30/2019 AHC.pdf
12/12
Ch 3 Huffman Coding 45
3.8.3 Audio Compression
9 Each stereo channel is sampled at 44.1kHz
9 Each sample is represented by 16 bits.
( the amount of data stored on one CD is enormous )
CD-quality audio dataCD-quality audio data
16 bits : 65,536 distinct values
Huffman coder would require 65,536 distinct (variable-length) codewords.
In most applications, a codeword of this size would not be practical.
Large alphabet
9 Recursive indexing chapter 8
9 Others [ reference: #180]
Ch 3 Huffman Coding 46
3.8.3 Audio Compression
Table 3.28 Huffman codes of 16-bit CD-quality audio.
Mozart 939,862 12.8 725,420 1.30Cohn 402,442 13.8 349,300 1.15Mir 884,020 13.7 759,540 1.16
File Name CompressionRatio
Original FileSize(bytes)
Entropy(bits) Estimated CompressedFile Size(bytes)
Table 3.29 Huffman codes of differences of 16-bit CD-quality audio.
Mozart 939,862 9.7 569,792 1.65Cohn 402,442 10.4 261,590 1.54Mir 884,020 10.9 602,240 1.47
File NameCompression
RatioOriginal FileSize(bytes)
Entropy(bits)Estimated Compressed
File Size(bytes)