4.8 huffman codes - princeton university computer sciencewayne/kleinberg-tardos/pearson/04... ·...

24
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd

Upload: dinhhanh

Post on 23-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

4.8 Huffman Codes

These lecture slides are supplied by Mathijs de Weerd

Page 2: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

2

Data Compression

Q. Given a text that uses 32 symbols (26 different letters, space, andsome punctuation characters), how can we encode this text in bits?

Q. Some symbols (e, t, a, o, i, n) are used far more often than others.How can we use this to reduce our encoding?

Q. How do we know when the next symbol begins?

Ex. c(a) = 01 What is 0101? c(b) = 010 c(e) = 1

Page 3: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

3

Data Compression

Q. Given a text that uses 32 symbols (26 different letters, space, andsome punctuation characters), how can we encode this text in bits?A. We can encode 25 different symbols using a fixed length of 5 bits persymbol. This is called fixed length encoding.

Q. Some symbols (e, t, a, o, i, n) are used far more often than others.How can we use this to reduce our encoding?A. Encode these characters with fewer bits, and the others with more bits.

Q. How do we know when the next symbol begins?A. Use a separation symbol (like the pause in Morse), or make sure thatthere is no ambiguity by ensuring that no code is a prefix of another one.

Ex. c(a) = 01 What is 0101? c(b) = 010 c(e) = 1

Page 4: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

4

Prefix Codes

Definition. A prefix code for a set S is a function c that maps each x∈S to 1s and 0s in such a way that for x,y∈S, x≠ y, c(x) is not a prefix ofc(y).

Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000Q. What is the meaning of 1001000001 ?

Suppose frequencies are known in a text of 1G:fa=0.4, fe=0.2, fk=0.2, fl=0.1, fu=0.1Q. What is the size of the encoded text?

Page 5: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

5

Prefix Codes

Definition. A prefix code for a set S is a function c that maps each x∈S to 1s and 0s in such a way that for x,y∈S, x≠ y, c(x) is not a prefix ofc(y).

Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000Q. What is the meaning of 1001000001 ?A. “leuk”

Suppose frequencies are known in a text of 1G:fa=0.4, fe=0.2, fk=0.2, fl=0.1, fu=0.1Q. What is the size of the encoded text?A. 2*fa + 2*fe + 3*fk + 2*fl + 4*fu = 2.4G

Page 6: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

6

Optimal Prefix Codes

Definition. The average bits per letter of a prefix code c is the sumover all symbols of its frequency times the number of bits of itsencoding:

We would like to find a prefix code that is has the lowest possibleaverage bits per letter.

Suppose we model a code in a binary tree…

!"

#=Sx

x xcfcABL )()(

Page 7: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

7

Representing Prefix Codes using Binary Trees

Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000

Q. How does the tree of a prefix code look?

l

u

ae

k

0

0

0

0

1

11

1

Page 8: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

8

Representing Prefix Codes using Binary Trees

Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000

Q. How does the tree of a prefix code look?A. Only the leaves have a label.Pf. An encoding of x is a prefix of an encoding of y if and only if thepath of x is a prefix of the path of y.

l

u

ae

k

0

0

0

0

1

11

1

Page 9: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

9

Representing Prefix Codes using Binary Trees

Q. What is the meaning of111010001111101000 ?

l

e

m

0

0

0

0

1

11

1

p

1

i

0

s

1

!"

#=Sx

Tx xfTABL )(depth)(

Page 10: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

10

Representing Prefix Codes using Binary Trees

Q. What is the meaning of111010001111101000 ?

A. “simpel”

Q. How can this prefix code be made more efficient?

l

e

m

0

0

0

0

1

11

1

p

1

i

0

s

1

!"

#=Sx

Tx xfTABL )(depth)(

Page 11: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

11

Representing Prefix Codes using Binary Trees

Q. What is the meaning of111010001111101000 ?

A. “simpel”

Q. How can this prefix code be made more efficient?A. Change encoding of p and s to a shorter one.This tree is now full.

l

e

m

0

0

0

0

1

11

1

p

1

i

0

s

1s

0

!"

#=Sx

Tx xfTABL )(depth)(

Page 12: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

12

Definition. A tree is full if every node that is not a leaf has twochildren.

Claim. The binary tree corresponding to the optimal prefix code is full.Pf.

Representing Prefix Codes using Binary Trees

v

w

u

Page 13: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

13

Definition. A tree is full if every node that is not a leaf has twochildren.

Claim. The binary tree corresponding to the optimal prefix code is full.Pf. (by contradiction)

Suppose T is binary tree of optimal prefix code and is not full. This means there is a node u with only one child v. Case 1: u is the root; delete u and use v as the root

Case 2: u is not the root– let w be the parent of u– delete u and make v be a child of w in place of u

In both cases the number of bits needed to encode any leaf in thesubtree of v is decreased. The rest of the tree is not affected.

Clearly this new tree T’ has a smaller ABL than T. Contradiction.

Representing Prefix Codes using Binary Trees

v

w

u

Page 14: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

14

Optimal Prefix Codes: False Start

Q. Where in the tree of an optimal prefix code should letters be placedwith a high frequency?

Page 15: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

15

Optimal Prefix Codes: False Start

Q. Where in the tree of an optimal prefix code should letters be placedwith a high frequency?A. Near the top.

Greedy template. Create tree top-down, split S into two sets S1 and S2

with (almost) equal frequencies. Recursively build tree for S1 and S2.[Shannon-Fano, 1949] fa=0.32, fe=0.25, fk=0.20, fl=0.18, fu=0.05

l

u

ae

k

e

u

ak

l

0.32 0.320.18

0.18

0.25 0.25

0.20

0.20

0.05 0.05

Page 16: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

16

Optimal Prefix Codes: Huffman Encoding

Observation. Lowest frequency items should be at the lowest level intree of optimal prefix code.

Observation. For n > 1, the lowest level always contains at least twoleaves.

Observation. The order in which items appear in a level does notmatter.

Claim. There is an optimal prefix code with tree T* where the twolowest-frequency letters are assigned to leaves that are siblings in T*.

Greedy template. [Huffman, 1952] Create tree bottom-up.Make two leaves for two lowest-frequency letters y and z.Recursively build tree for the rest using a meta-letter for yz.

Page 17: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

17

Optimal Prefix Codes: Huffman Encoding

Q. What is the time complexity?

Huffman(S) { if |S|=2 { return tree with root and 2 leaves } else { let y and z be lowest-frequency letters in S S’ = S remove y and z from S’ insert new letter ω in S’ with fω=fy+fz T’ = Huffman(S’) T = add two children y and z to leaf ω from T’ return T }}

Page 18: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

18

Optimal Prefix Codes: Huffman Encoding

Q. What is the time complexity?A. T(n) = T(n-1) + O(n) so O(n2)Q. How to implement finding lowest-frequency letters efficiently?A. Use priority queue for S: T(n) = T(n-1) + O(log n) so O(n log n)

Huffman(S) { if |S|=2 { return tree with root and 2 leaves } else { let y and z be lowest-frequency letters in S S’ = S remove y and z from S’ insert new letter ω in S’ with fω=fy+fz T’ = Huffman(S’) T = add two children y and z to leaf ω from T’ return T }}

Page 19: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

19

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. by induction, based on optimality of T’ (y and z removed, ω added)(see next page)

Claim. ABL(T’)=ABL(T)-fωPf.

Page 20: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

20

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. by induction, based on optimality of T’ (y and z removed, ω added)(see next page)

Claim. ABL(T’)=ABL(T)-fωPf.

!

ABL(T ) = fx "depthT (x)x#S

$

= fy "depthT (y)+ fz "depthT (z)+ fx "depthT (x)x#S,x%y ,z

$

= fy + fz( ) " 1+depthT (&)( )+ fx "depthT (x)x#S,x%y ,z

$

= f& " 1+depthT (&)( )+ fx "depthT (x)x#S,x%y ,z

$

= f& + fx "depthT ' (x)x#S '

$

= f& +ABL(T ' )

Page 21: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

21

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. (by induction over n=|S|)

Page 22: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

22

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. (by induction over n=|S|)Base: For n=2 there is no shorter code than root and two leaves.Hypothesis: Suppose Huffman tree T’ for S’ of size n-1 with ω insteadof y and z is optimal.Step: (by contradiction)

Page 23: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

23

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. (by induction)Base: For n=2 there is no shorter code than root and two leaves.Hypothesis: Suppose Huffman tree T’ for S’ of size n-1 with ω insteadof y and z is optimal. (IH)Step: (by contradiction)

Idea of proof:– Suppose other tree Z of size n is better.– Delete lowest frequency items y and z from Z creating Z’– Z’ cannot be better than T’ by IH.

Page 24: 4.8 Huffman Codes - Princeton University Computer Sciencewayne/kleinberg-tardos/pearson/04... · 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd. 2 ... Definition

24

Huffman Encoding: Greedy Analysis

Claim. Huffman code for S achieves the minimum ABL of any prefixcode.Pf. (by induction)Base: For n=2 there is no shorter code than root and two leaves.Hypothesis: Suppose Huffman tree T’ for S’ with ω instead of y and zis optimal. (IH)Step: (by contradiction)

Suppose Huffman tree T for S is not optimal. So there is some tree Z such that ABL(Z) < ABL(T). Then there is also a tree Z for which leaves y and z exist that are

siblings and have the lowest frequency (see observation). Let Z’ be Z with y and z deleted, and their former parent labeled ω. Similar T’ is derived from S’ in our algorithm. We know that ABL(Z’)=ABL(Z)-fω, as well as ABL(T’)=ABL(T)-fω.

But also ABL(Z) < ABL(T), so ABL(Z’) < ABL(T’). Contradiction with IH.