huffman coding. gabriele monfardini - corso di basi di dati multimediali a.a. 2005-2006 2 optimal...

37
Huffman coding

Upload: rebekah-shackelford

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Huffman coding

Page 2: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2

Optimal codes - I

A code is optimal if it has the shortest codeword length L

This can be seen as an optimization problem1

m

i ii

L p l

1

1

min

subject to 1i

m

i ii

ml

i

l p

D

Page 3: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3

Optimal codes - II

Let’s make two simplifying assumptions no integer constraint on the codelengths Kraft inequality holds with equality

Lagrange-multiplier problem

1 1

1i

m ml

i ii i

J p l D

0 log 0 log

j jl l jj

j

pJp D D D

l D

Page 4: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4

Optimal codes - III

Substitute into the Kraft inequality

that is

Note that

logjl jpD

D

1

11

log logi

mli

ii

pp D

D D

* logi D il p

**

1 1

log ( ) !!m m

i i i D ii i

Dp l p pL H X

the entropy, when we use base D for logarithms

Page 5: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5

Optimal codes - IV

In practice the codeword lengths must be integer value, so obtained results is a lower bound

TheoremThe expected length of any istantaneous D-ary code for a r.v. X satisfies

this fundamental result derives frow the work of Shannon

( )DL H x

Page 6: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6

Optimal codes - V

What about the upper bound?

TheoremGiven a source alphabet (i.e. a r.v.) of entropy it is possible to find an instantaneous binary code which length satisfies

A similar theorem could be stated if we use the wrong probabilities instead of the true ones ; the only difference is a term which accounts for the relative entropy

( )H X

( ) ( ) 1H X L H X

ip iq

Page 7: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7

The redundance

It is defined as the average codeword legths minus the entropy

Note that

(why?)

Redundancy logi ii

L p p

0 redundancy 1

Page 8: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8

Compression ratio

It is the ratio between the average number of bit/symbol in the original message and the same quantity for the coded message, i.e.

average original symbol length

average compressed symbol lengthC

( )!!L X

Page 9: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9

Uniquely decodable codes

The set of the instantaneous codes are a small subset of the uniquely decodable codes.

It is possible to obtain a lower average code length L using a uniquely decodable code that is not

instantaneous? NO So we use instantaneous codes that are easier to

decode

Page 10: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10

Summary

Average codeword length L for uniquely decodable codes

(and for instantaneous codes) In practice for each r.v. with entropy

we can build a code with average codeword length that satisfies

( )L H X

( )H XX

( ) ( ) 1H X L H X

Page 11: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11

Shannon-Fano coding The main advantage of the Shannon-Fano

technique is its semplicity Source symbols are listed in order of nonincreasing

probability. The list is divided in such a way to form two groups of

as nearly equal probabilities as possible Each symbol in the first group receives a 0 as first

digit of its codeword, while the others receive a 1 Each of these group is then divided according to the

same criterion and additional code digits are appended

The process is continued until each group contains only one message

Page 12: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12

example

H=1.9375 bits

L=1.9375 bits

1 2

1 4

1 8

1 16

1 32

1 32

a

b

c

d

e

f

0

1

1

1

1

1

0

1

1

1

1

0

1

1

1

0

1

1

0

1

Page 13: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13

Shannon-Fano coding - exercise

Encode, using Shannon-Fano algorithm

Symb. Prob.

* 12%

? 5%

! 13%

& 2%

$ 29%

€ 13%

§ 10%

° 6%

@ 10%

Page 14: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14

Is Shannon-Fano coding optimal?

H=2.2328 bits

L=2.31 bits

0.35

0.17

0.17

0.16

0.15

a

b

c

d

e

00

01

10

110

111

0

100

101

110

111 L1=2.3 bits

Page 15: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15

Huffman coding - I

There is another algorithm which performances are slightly better than Shanno-Fano, the famous Huffman coding

It works constructing bottom-up a tree, that has symbols in the leafs

The two leafs with the smallest probabilities becomes sibling under a parent node with probabilities equal to the two children’s probabilities

Page 16: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16

Huffman coding - II

At this time the operation is repeated, considering also the new parent node and ignoring its children

The process continue until there is only parent node with probability 1, that is the root of the tree

Then the two branches for every non-leaf node are labeled 0 and 1 (typically, 0 on the left branch, but the order is not important)

Page 17: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17

Huffman coding - example

0

Symbol Prob.

0.05

0.05

0.1

0.2

0.3

0.2

0.1

a

b

c

d

e

f

g a0.05

b0.05

c0.1

d0.2

e0.3

f0.2

g0.1

0.1

0.2

0.3

0.4

0.6

1.00

0

0

0

0

1

1

1

1

1

1

a0.05

b0.05

c0.1

d0.2

e0.3

f0.2

g0.1

0.1

0.2

0.3

0.4

0.6

1.0

Page 18: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18

Huffman coding - example

Exercise: evaluate H(X) and L(X)

H(X)=2.5464 bits

L(X)=2.6 bits !!

Symbol Prob. Codeword

0.05 0000

0.05 0001

0.1 001

0.2 01

0.3 10

0.2 11

a

b

c

d

e

f 0

0.1 111g

Page 19: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19

Huffman coding - exercise

Code the sequence

aeebcddegfced and calculate the compression ratio

Sol: 0000 10 10 0001 001 01 01

10 111 110 001 10 01

Aver. orig. symb. length = 3 bits

Aver. compr. symb. length = 34/13

C=.....

Symbol Prob. Codeword

0.05 0000

0.05 0001

0.1 001

0.2 01

0.3 10

0.2 11

a

b

c

d

e

f 0

0.1 111g

Page 20: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20

Huffman coding - exercise

Decode the sequence0111001001000001111110

Sol: dfdcadgf

Symbol Prob. Codeword

0.05 0000

0.05 0001

0.1 001

0.2 01

0.3 10

0.2 11

a

b

c

d

e

f 0

0.1 111g

Page 21: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21

Huffman coding - exercise

Encode with Huffman the sequence01$cc0a02ba10

and evaluate entropy, average codeword length and compression ratio

Symb. Prob.

0.10

0.03

0.14

0 0.4

1 0.22

2 0.04

$ 0.07

a

b

c

Page 22: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22

Huffman coding - exercise

Symb. Prob.

0 0.16

1 0.02

2 0.15

3 0.29

4 0.17

5 0.04

% 0.17

Decode (if possible) the Huffman coded bit streaming01001011010011110101...

Page 23: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 23

Huffman coding - notes

In the huffman coding, if, at any time, there is more than one way to choose a smallest pair of probabilities, any such pair may be chosen

Sometimes, the list of probabilities is inizialized to be non-increasing and reordered after each node creation. This details doesn’t affect the correctness of the algorithm, but it provides a more efficient implementation

Page 24: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24

Huffman coding - notes

There are cases in which the Huffman coding does not uniquely determine codeword lengths, due to the arbitrary choice among equal minimum probabilities.

For example for a source with probabilities it is possible to obtain codeword lengths of and of

It would be better to have a code which codelength has the minimum variance, as this solution will need the minimum buffer space in the transmitter and in the receiver

0.4, 0.2, 0.2, 0.1, 0.1

1, 2, 3, 4, 4 2, 2, 2, 3, 3

Page 25: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25

Huffman coding - notes

Schwarz defines a variant of the Huffman algorithm that allows to build the code with minimum .

There are several other variants, we will explain the most important in a while.

maxl

Page 26: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26

Optimality of Huffman coding - I

It is possible to prove that, in case of character coding (one symbol, one codeword), Huffman coding is optimal

In another terms Huffman code has minimum redundancy

An upper bound for redundancy has been found

where is the probability of the most likely simbol

1 2 2 2 1redundancy 1 log log log 0.086p e e p

1p

Page 27: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27

Optimality of Huffman coding - II

Why Huffman code “suffers” when there is one symbol with very high probability?

Remember the notion of uncertainty...

The main problem is given by the integer constraint on codelengths!!

This consideration opens the way to a more powerful coding... we will see it later

( ) 1 log( ( )) 0p x p x

Page 28: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28

Huffman coding - implementation

Huffman coding can be generated in O(n) time, where n is the number of source symbols, provided that probabilities have been presorted (however this sort costs O(nlogn)...)

Nevertheless, encoding is very fast

Page 29: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29

Huffman coding - implementation However, spatial and temporal complexity of

the decoding phase are far more important, because, on average, decoding will happen more frequently.

Consider a Huffman tree with n symbols n leafs and n-1 internal nodes

has the pointer to a symbol and the info that it is a leaf

has two pointers

2 2( 1) 4 words (32 bits)n n n

Page 30: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30

Huffman coding - implementation

1 million symbols 16 MB of memory! Moreover traversing a tree from root to leaf

involves follow a lot of pointers, with little locality of reference. This causes several page faults or cache misses.

To solve this problem a variant of Huffman coding has been proposed: canonical Huffman coding

Page 31: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 31

canonical Huffman coding - I

Symb. Prob. Code 1 Code 2 Code 3

0.11 000

0.12 001

0.13 100

111

1

000

001

0

10

01

10

0

1

a

b

c

d .14 101

0.24 01

0.26 11

010

10

00

011

10

1 1

e

f

b0.12

c0.13

d0.14

e0.24

f0.26

a0.11

0.23 0.27

0.470.53

1.0

0

0

0

0

0

1

1

1 1

1

(0)

(0)

(0)

(0)(0)

(1)

(1)(1)

(1) (1)

?

Page 32: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 32

canonical Huffman coding - II This code cannot be obtained

through a Huffman tree! We do call it an Huffman code

because it is instantaneous and the codeword lengths are the same than a valid Huffman code

numerical sequence property codewords with the same length are

ordered lexicographically when the codewords are sorted in lexical

order they are also in order from the longest to the shortest codeword

Symb. Code 3

000

001

010

011

10

1 1

a

b

c

d

e

f

Page 33: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 33

canonical Huffman coding - III

The main advantage is that it is not necessary to store a tree, in order to decoding

We need a list of the symbols ordered according to the lexical

order of the codewords an array with the first codeword of each distinct

length

Page 34: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

34

canonical Huffman coding - IVEncoding. Suppose there are n disctinct symbols, that for symbol

i we have calculated huffman codelength andil ii l maxlength

for 1 to { [ ] 0; }

for 1 to { [ ] [ ] 1; }

[ ] 0;

for 1 downto 1 {

[ ] ( [ 1] [ 1]) / 2 ; }

for 1 to

i i

k maxlength numl k

i n numl l numl l

firstcode maxlength

k maxlength

firstcode k firstcode k numl k

k maxlength

{ [ ]= [ ]; }

for 1 to {

[ ] [ ];

, [ ] - [ ] ;

[ ] [ ] 1; }

i

i i i

i i

nextcode k firstcode k

i n

codeword i nextcode l

symbol l nextcode l firstcode l i

nextcode l nextcode l

numl[k] = number of codewords with length k

firstcode[k] = integer for first code of length k

nextcode[k] = integer for the next codeword of length k to be assigned

symbol[-,-] used for decoding

codeword[i] the rightmost bits of this integer are the code for symbol i

il

Page 35: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

35

canonical Huffman - example

1. Evaluate array numlSymb. length

2

5

5

3

2

5

5

2

ii l

a

b

c

d

e

f

g

h

: [0 3 1 0 4]numl

2. Evaluate array firstcode

: [2 1 1 2 0]firstcode 3. Construct array codeword and symbol

for 1 to {

[ ]= [ ]; }

for 1 to {

[ ] [ ];

, [ ] - [ ] ;

[ ] [ ] 1; }

i

i i i

i i

k maxlength

nextcode k firstcode k

i n

codeword i nextcode l

symbol l nextcode l firstcode l i

nextcode l nextcode l

- - - -

a e h -

d - - -

- - - -

b c f g

symbol0 1 2 3 1

2

3

4

5

code bits

word

1 01

0 00000

1 00001

1 001

2 10

2 00010

3 00011

3 11

for 1 downto 1 {

[ ] ( [ 1]

[ 1]) / 2 ; }

k maxlength

firstcode k firstcode k

numl k

Page 36: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 36

canonical Huffman coding - VDecoding. We have the arrays firstcode and symbols

();

1;

while [ ] {

2* ();

1; }

Return , [ ] ;

v nextinputbit

k

v firstcode k

v v nextinputbit

k k

symbol k v firstcode k

nextinputbit() function that returns next input bit

firstcode[k] = integer for first code of length k

symbol[k,n] returns the symbol number n with codelength k

Page 37: Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2 Optimal codes - I A code is optimal if it has the shortest codeword

37

canonical Huffman - example

();

1;

while [ ] {

2* ();

1; }

Return , [ ] ;

v nextinputbit

k

v firstcode k

v v nextinputbit

k k

symbol k v firstcode k

- - - -

a e h -

d - - -

- - - -

b c f g

symbol0 1 2 3 1

2

3

4

5: [2 1 1 2 0]firstcode

00 00 00 000 0011 11 11

Decoded: dhebad

00 00 00 000 0011 11 11

symbol[3,0] = dsymbol[2,2] = hsymbol[2,1] = esymbol[5,0] = bsymbol[2,0] = asymbol[3,0] = d

symbol[3,0] = dsymbol[2,2] = hsymbol[2,1] = esymbol[5,0] = bsymbol[2,0] = asymbol[3,0] = d