Download - Huffman Edited
-
8/3/2019 Huffman Edited
1/32
Representation of Strings
ENCODING
How much space do we need?
Assume we represent every character.
How many bits to represent each
character?Depends on ||
-
8/3/2019 Huffman Edited
2/32
Bits to encode a character
Two character alphabet{A,B} one bit per character:
0 = A, 1 = B
Four character alphabet{A,B,C,D} two bits per character:
00 = A, 01 = B, 10 = C, 11 = D
Six character alphabet {A,B,C,D,E, F} three bits per character:
000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101=F, 110 =unused, 111=unused
-
8/3/2019 Huffman Edited
3/32
Generally
The bit sequence representing a character iscalled the encoding of the character.
There are 2n different bit sequences of length
n,
if we use the same number of bits for eachcharacter then length of encoding of a word isequal to number of bits to its number of
character
-
8/3/2019 Huffman Edited
4/32
Can we do better??Better solution If is very small, might use run-length
encoding
Taking a step back Why do we need compression?
rate of creation of image and video data can bereduce
image data from digital camera today 1k by 1.5 k is common = 1.5 Mbytes
need 2k by 3k to equal 35mm slide = 6 Mbytes
video at even low resolution of 512 by 512 and 3 bytes per pixel
30 frames/second
-
8/3/2019 Huffman Edited
5/32
Compression basics video data rate
23.6 Mbytes/second
2 hours of video = 169 gigabytes mpeg-1 compresses
23.6 Mbytesdown to 187 kbytes per second
169 gigabytes down to 1.3 gigabytes compression is essential for both storage and
transmission of data
compression is very widely usedjpeg, gif for single images
mpeg1, 2, 3, 4 for video sequence
zip for computer data
mp3 for sound
-
8/3/2019 Huffman Edited
6/32
Basics of compression
character = basic data unit in the input stream --represents byte, bit, etc.
strings = sequences of characters
encoding = compression decoding = decompression
codeword = data elements used to represent inputcharacters or character strings
codetable = list of codewords
-
8/3/2019 Huffman Edited
7/32
Codeword encoding/compression takes
characters/strings as input and uses codetable todecide on which codewords to produce
decoder/decompressor takes
codewords as input and uses same codetable to decideon which characters/strings to produce
Encoder Decoder
InputDataStream
OutputDataStream
DataStorageOrTransmission
-
8/3/2019 Huffman Edited
8/32
-
8/3/2019 Huffman Edited
9/32
Basic definitions
compression ratio =
size of original data / compressed data
basically higher compression ratio the better
lossless compression output data is exactly same as input data
essential for encoding computer processed data
lossy compression
output data not same as input data
acceptable for data that is only viewed or heard
-
8/3/2019 Huffman Edited
10/32
Lossless versus lossy human visual system less sensitive to high frequency losses
and to losses in color
lossy compression acceptable for visual data degree of loss is usually a parameter of the compression
algorithm
tradeoff - loss versus compression
higher compression => more loss
lower compression => less loss
Symmetric versus asymmetric symmetric
encoding time == decoding time
essential for real-time applications
(ie. video or audio on demand)
asymmetric encoding time >> decoding
ok for write-once, read-many situations
E di
-
8/3/2019 Huffman Edited
11/32
Entropy encoding compression that does not take into account
what is being compressed
normally is also lossless encoding most common types of entropy encoding
run length encoding
Huffman encoding
modified Huffman (fax) Lempel Ziv
Source encoding takes into account type of data (ie. visual)
normally is lossy but can also be lossless
most common types in use:
JPEG, GIF = single images
MPEG = sequence of images (video)
MP3 = sound sequence
-
8/3/2019 Huffman Edited
12/32
Run length encoding one of simplest and earliest types of compression
take account of repeating data (called runs) runs are represented by a count along with the original data
eg. AAAABB => 4A2B
do you run length encode a single character?
no, use a special prefix character to represent start of runs
runs are represented as
prefix char itself becomes1
want a prefix char that is not too common
run length encoding is lossless and has fixed length
codewords
-
8/3/2019 Huffman Edited
13/32
Run length encoding
works best for images with solidbackground
good example of such an image is acartoon
does not work as well for naturalimages
does not work well for English text however, is almost always a part of a
larger compression system
-
8/3/2019 Huffman Edited
14/32
What if
the string we encode doesnt use all theletters in the alphabet?
But then also need to store / transmit
the mapping from encodings tocharacters
and is typically close to size ofalphabet
H ff E di
-
8/3/2019 Huffman Edited
15/32
Huffman Encoding: Assumes encoding on a per-character basis
Observation: assigning shorter codes tofrequently used characters can result inoverall shorter encodings of strings requires assigning longer codes to rarely
used characters
Problem: when decoding, need to know how many bits to
read off for each character.
Solution: Choose an encoding that ensures that no character
encoding is the prefix of any other character
encoding. An encoding tree has this property.
k h f f h h h
-
8/3/2019 Huffman Edited
16/32
assume we know the frequency of each character in theinput stream
then encode each character as a variable length bit string,with the length inversely proportional to the character
frequency variable length codewords are used; early example is
Morse code
Huffman produced an algorithm for assigning codewords
optimally input = probabilities of occurrence of each input character
(frequencies of occurrence)
output is a binary tree
each leaf node is an input character each branch is a zero or one bit
codeword for a leaf is the concatenation of bits for the pathfrom the root to the leaf
codeword is a variable length bit string
a very good compression ratio (optimal)?
-
8/3/2019 Huffman Edited
17/32
Huffman encoding
Basic algorithmMark all characters as free tree nodes
While there is more than one free node
Take two nodes with lowest freq. of occurrenceCreate a new tree node with these nodes as children
and with freq. equal to the sum of their freqs.
Remove the two children from the free node list.
Add the new parent to the free node list
Last remaining free node is the root of thebinary tree used for encoding/decoding
-
8/3/2019 Huffman Edited
18/32
A Huffman Encoding Tree
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
-
8/3/2019 Huffman Edited
19/32
12
21
9
7
43
5
23
A T R N
E
0 1
0 1
0 1 0 1
A 000
T 001
R 010
N 011
E 1
-
8/3/2019 Huffman Edited
20/32
Weighted path length
A 000
T 001
R 010
N 011
E 1
Weighted path = Len(code(A)) * f(A) +
Len(code(T)) * f(T) + Len(code(R) ) * f(R) +
Len(code(N)) * f(N) + Len(code(E)) * f(E)
= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)
= 9 + 6 + 9 + 12 + 9 = 45
Claim (proof in text) : no other encoding can result in ashorter weighted path length
-
8/3/2019 Huffman Edited
21/32
Building the Huffman Tree
A
3
T
4
R
4
E
5
-
8/3/2019 Huffman Edited
22/32
Building the Huffman Tree
A3
T4
R
4
E
5
7
-
8/3/2019 Huffman Edited
23/32
Building the Huffman Tree
R4
E5
A3
T4
7
-
8/3/2019 Huffman Edited
24/32
Building the Huffman Tree
R
4
E
5
A3
T4
79
-
8/3/2019 Huffman Edited
25/32
Building the Huffman Tree
A3
T4
7
R4
E5
9
-
8/3/2019 Huffman Edited
26/32
Building the Huffman Tree
A3
T4
7
R4
E5
9
16
-
8/3/2019 Huffman Edited
27/32
Building the Huffman Tree
A3
T4
7
R4
E5
9
16
0
0 1
1
0 1
00 01 10 11
-
8/3/2019 Huffman Edited
28/32
Huffman example
a series of colors in an 8 by 8screen
colors are red, green, cyan, blue,magenta, yellow, and black
sequence is
rkkkkkkk gggmcbrr
kkkrrkkk bbbmybbr
kkrrrrgg gggggggr
kkbcccrr grrrrgrr
-
8/3/2019 Huffman Edited
29/32
Another Huffman example
Color Frequency
Black (K) 19
Red ( R) 17
Green (G) 16
Blue (B) 5
Cyan ( C) 4Magenta (M) 2
Yellow (Y) 1
Another Huffman Example
-
8/3/2019 Huffman Edited
30/32
Another Huffman Example
Red = 00 Blue = 111 Magenta = 11010
Black = 01 Cyan = 1100 Yellow = 11011
Green = 10
Fixed versus variable length codewords
-
8/3/2019 Huffman Edited
31/32
Fixed versus variable length codewords
run length codewords are fixed length
Huffman codewords are variable length
length inversely proportional to frequency
all variable length compression schemes have theprefix property
one code can not be the prefix of another
binary tree structure guarantees that this is the case(a leaf node is a leaf node!)
Huffman encoding
-
8/3/2019 Huffman Edited
32/32
Huffman encoding advantages
max compression ratio assuming correct probabilities of occurrence
easy to implement and fast
disadvantages need two passes for both encoder and decoder
one to create the frequency distribution
one to encode/decode the data
can avoid this by sending tree (takes time) or by havingunchanging frequencies
Modified Huffman encoding if we know freq of occurrences, then Huffman works very well
consider case of a fax; mostly long white spaces with short bursts
of black do the following
run length encode each string of bits on a line
Huffman encode these run length codewords
use a predefined frequency distribution
combination run length then Huffman