15-211 fundamental data structures and algorithms aleks nanevski february 10, 2004 based on a...

15-211Fundamental Data Structures and Algorithms

Aleks NanevskiFebruary 10, 2004

based on a lecture by Peter Lee

LZW Compression

Last Time…

Problem: data compression

Convert a string into a shorter string.Lossless – represents exactly

the same information.Lossy – approximates the

original information. Uses of compression:

Images over the web: JPEGMusic: MP3General-purpose: ZIP, GZIP, JAR, …

Huffman trees

Huffman’s algorithm

Huffman’s algorithm gives the optimal prefix code.

For a nice online demo, see http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html





Huffman compression

Huffman trees provide a straightforward method for file compression.1. Scan the file and compute frequencies2. Build the code tree3. Write code tree to the output file as a

header4. Scan input, encode, and write into the

output file

Huffman decompression

Read the header in the compressed file, and build the code tree

Read the rest of the file, decode using the tree

Write to output

Beating Huffman

How about doing better than Huffman!

Impossible!Huffman’s algorithm gives the optimal

prefix code!

Right.But who says we have to use a prefix

code?

Example

Suppose we have a file containingabcdabcdabcdabcdabcdabcd…

abcdabcd

This could be expressed very compactly asabcd^1000

Dictionary-BasedCompression

Dictionary-based methods

Here is a simple idea:Keep track of “words” that we have seen, and

replace them with a code number when we see them again.

The code is typically shorter than the word

We can maintain dictionary entries (word, code)

and make additions to the dictionary as we read the input file.

Lempel & Ziv (1977/78)

Fred Hacker’s algorithm…

Fred now knows what to do…

Create the dictionary:

( <the-whole-file>, 1 )

Transmit 1, done.

Right?

Fred’s algorithm provides excellent compression, but…

Right?

Fred’s algorithm provides excellent compression, but…

…the receiver does not know what is in the dictionary!And sending the dictionary is the same as

sending the entire uncompressed file

Thus, we can’t decompress the “1”.

Hence…

…we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

LZW Compression:The Binary Version

LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

Maintaining a dictionary

We need a way of incrementally building up a dictionary during compression in such a way that…

…someone who wants to uncompress can “rediscover” the very same dictionary

And we already know that a convenient way to build a dictionary incrementally is to use a trie

Binary LZW

In this method, we build up binary tries In a binary trie, each node has two

children In addition, we will add the following:

each left edge is marked 0each right edge is marked 1each leaf has a label from the set {0,…,n}

A binary trie

0 1

0

3

1 2

4 5

0 0

0

0

11

1

1

Binary LZW: Compression

1. We start with a binary trie consisting of a root node and two children

left child labeled 0, and right labeled 1

2. We read the bits of the input file, and follow the trie

3. When a leaf is reached, we emit the label at the leaf

4. Then, add two new children to that leaf (converting it into an internal node)

Binary LZW: Compression, pt.2

5. The new left child takes the old label

6. The new right child takes a new label value that is one greater than the current maximum label value

Binary LZW: Compression example

10010110011Input:^

0 1

0 1

Dictionary:

Output:


10010110011Input:^

0 1

0

Dictionary:

Output: 1

1 2

0 1


10010110011Input:^

0 1

3

Dictionary:

Output: 10

1 2

0 1

0

0 1


10010110011Input:^

0 1

4

Dictionary:

Output: 103

1 2

0 1

0

0 1

3

0 1


10010110011Input:^

0 1

Dictionary:

Output: 1034

1 2

0 1

0

0 1

3

0 1

4 5

0 1


10010110011Input:^

0 1

Dictionary:

Output: 10340

1 2

0 10 1

3

0 1

4 5

0 1

0 6


10010110011Input:^

0 1

Dictionary:

Output: 103402

1

0 10 1

3

0 1

4 5

0 1

0 6 2

0 1

7

Binary LZW output

So from the input10010110011

we get output103402

To represent this output we can keep track of the number of labels n each time we emit a codeand use log(n) bits for that code

Binary LZW output

We started with input

10010110011

Encoded it as 103402, for which we get the bit sequence 001 000 011 100 000 010

This looks like an expansion instead of a compression

But what if we have a larger input, with more repeating sequences?

Try it!

Binary LZW output

One can also use Huffman compression on the output…

Binary LZW termination

Note that binary LZW has a serious problem, in that the input might end while we are in the middle of the trie (instead of at a leaf node)

This is a nasty problemwhich is why we won’t use this binary

methodBut this is still good for illustration

purposes…

Binary LZW: Uncompress

To uncompress, we need to read the compressed file and rebuild the same trie as we go along

To do this, we need to maintain the trie and also the maximum label value

Binary LZW: Uncompress example

103402Input:^

0 1

Dictionary:

Output:

10


103402Input:^

0 1

Dictionary:

Output: 1

1

0 1

2

0


103402Input:^

0 1

Dictionary:

Output: 10

1

0 10 1

203


103402Input:^

0 1

Dictionary:

Output: 1001

1

0 10 1

3

0 1

20

4


103402Input:^

0 1

Dictionary:

Output: 1001011

1

0 10 1

3

0 1

4 5

0 1

0 2


103402Input:^

0 1

Dictionary:

Output: 100101100

1

0 10 1

3

0 1

4 5

0 1

0 6

2


103402Input:^

0 1

Dictionary:

Output: 10010110011

1

0 10 1

3

0 1

4 5

0 1

0 6 2

0 1

7

LZW Compression:The Byte Version

Byte method

The binary LZW method doesn’t really workwe show it for illustrative purposes

Instead, we use a slightly more complicated version that works on bytes or charactersWe can think of each byte as a

“character” in the range {0…255}

Byte method trie

Instead of a binary trie, we use a more general trie in whicheach node can have up to n children

(where n is the size of the alphabet), one for each byte/character

every node (not just the leaves) has an integer label from the set {0…m}, for some m• except the root node, which has no label

Byte method LZW

We start with a trie that contains a root and n childrenone child for each possible charactereach child labeled 0…n

When we compress as before, by walking down the triebut, after emitting a code and growing

the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

LZW: Byte method example

Suppose our entire character set consists only of the four letters:{a, b, c, d}

Let’s consider the compression of the stringbaddad

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d


baddadInput:^

a bDictionary:

Output:

10 32

c d

1

4

a


baddadInput:^

a bDictionary:

Output:

10 32

c d

10

4

a

5

d


baddadInput:^

a bDictionary:

Output:

10 32

c d

103

4

a

5

d

6

d


baddadInput:^

a bDictionary:

Output:

10 32

c d

1033

4

a

5

d

6

d

7

a


baddadInput:^

a bDictionary:

Output:

10 32

c d

10335

4

a

5

d

6

d

7

a

Byte LZW output

So, the inputbaddad

compresses to10335

which again can be given in bit form, just like in the binary method…

…or compressed again using Huffman

Byte LZW: Uncompress example

The uncompress step for byte LZW is the most complicated part of the entire process, but is largely similar to the binary method


10335Input:^

a bDictionary:

Output:

10 32

c d


10335Input:^

a bDictionary:

Output:

10 32

c d

b


10335Input:^

a bDictionary:

Output:

10 32

c d

ba

4

a


10335Input:^

a bDictionary:

Output:

10 32

c d

bad

4

a

5

d


10335Input:^

a bDictionary:

Output:

10 32

c d

badd

4

a

5

d

6

d


10335Input:^

a bDictionary:

Output:

10 32

c d

baddad

4

a

5

d

6

d

7

a

LZW Byte method:An alternative presentation

Getting off the ground

Suppose we want to compress a file containing only letters a, b, c and d.

It seems reasonable to start with a dictionary

a:0 b:1 c:2 d:3

At least we can then deal with the first letter.

And the receiver knows how to start.

Growing pains

Now suppose the file starts like so:

a b b a b b …

We scan the a, look it up and output a 0.

After scanning the b, we have seen the word ab. So, we add it to the dictionary

a:0 b:1 c:2 d:3 ab:4

Growing pains

We already scanned the first b.

a b b a b b …

Then we get another b.

We output a 1 for the first b, and add bb to the dictionary

a:0 b:1 c:2 d:3 ab:4 bb:5

So?

Right, so far zero compression.

We already scanned the second b.

a b b a b b …

After scanning a, we output 1 for the b, and put ba in the dictionary

… d:3 ab:4 bb:5 ba:6

Still zero compression.

But now…

We already scanned a.

a b b a b b …

We scan the next b, and ab : 4 is in the dictionary.

We scan the next b, output 4, and put abb into the dictionary.

… d:3 ab:4 bb:5 ba:6 abb:7

We got compression, because 4 is shorter than ab.

We already scanned the last b

a b b a b b …

Suppose the input continues

a b b a b b b b a …

We scan the next b, and bb:5 is in the dictionary

We scan the next b, output 5, and put bbb into the dictionary

… ab:4 bb:5 ba:6 abb:7 bbb:8

And so on

More Hits

As our dictionary grows, we are able to replace longer and longer blocks by short code numbers.

a b b a b b b b a …

0 1 1 4 5 6

And we increase the dictionary at each step by adding another word.

Summary

where each prefix is in the dictionary.

We stop when we fall out of the dictionary:

a1 a2 a3 …. ak b

We scan a sequence of symbols

a1 a2 a3 …. ak

Summary (cont’d)

We output the code for a1 a2 a3 …. ak and

put a1 a2 a3 …. ak b into the dictionary.

Then we set

a1 = b

And start all over.

More importantly

Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.

Start with the same initialization, then

Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.

Sort of

Let's take a closer look at an example.

Assume alphabet {a,b,c}.

The code for aabbaabb is 0 0 1 1 3 5.

The decoding starts with dictionary D:

0:a, 1:b, 2:c

Moving along

The first 4 code words are already in D.

0 0 1 1 3 5

and produce output a a b b.

As we go along, we extend D:

0:a, 1:b, 2:c, 3:aa, 4:ab, 5:bb

For the code numbers 3 5, get

a a b b a a b b

Done

We have also added to D:

6:ba, 7:aab

But these entries are never used.

Everything is easy, since there is already an entry in D for each code number when we encounter it.

Is this it?

Unfortunately, no.

It may happen that we run into a code word without having an appropriate entry in D.

But, it can only happen in very special circumstances, and we can manufacture the missing entry.

A Bad Run

Consider input

a a b b b a a ==> 0 0 1 5 3

After reading 0 0 1, we output

a a b

and extend D with codes for aa and ab

0:a, 1:b, 2:c, 3:aa, 4:ab

Disaster

We have read 0 0 1 from the input

0 0 1 5 3

The dictionary is

0:a, 1:b, 2:c, 3:aa, 4:ab

The next code number to read is 5, but it’s not in D.

How could this have happened?

Can we recover?

… narrowly averted

This problem only arises when on the compressor end:

• the input contains a substring

…s s s …

• compressor read s , output code c for s , and added c+1: s s to the dictionary.

• Here s is a single symbol, but a (possibly empty) word.

… narrowly averted (pt. 2)

On the decompressor end, D contains

c: s

• but does not contain c+1: s s

• the decompressor has already output

x = s

and is now looking at unknown code number c+1.

… narrowly averted (pt. 3)

But then the fix is to output

x + first(x)

where x is the last decompressed word, and first(x) the first symbol of x.

Because x=s was already output, we get the required

s s s

We also update the dictionary to contain the new entry x+first(x) = s s.

In our example we have read 0 0 1 from the input

0 0 1 5 3

The last decompressed word is b, and the next code number to read is 5. Thus

• s = b

• = empty

•The next word to output and add to D is

s s = bb

Example

Summary

Let x be the last added word.

Ordinarily, D contains a word y matching to the input code number.

We output y and extend D with

x+ first (y)

But sometimes we immediately use x.

Then it must be x = s and we output

x + first(x) = s s

Example (extended)

0 0 1 5 3 6 7 9 5 aabbbaabbaaabaababb

Input Output add to D

0 a

0 + a 3:aa

1 + b 4:ab

5 - bb 5:bb

3 + aa 6:bba

6 + bba 7:aab

7 + aab 8:bbaa

9 - aaba 9:aaba

5 + bb 10:aabab

Pseudo Code: Compression

Initialize dictionary D to all words of length 1.

Read all input characters:

output code numbers from D,

extend D whenever a new word appears.

New code words: just an integer counter.

Less Pseudo

initialize D;

c = nextchar; // next input character

W = c; // a string

while( c = nextchar ) {

if( W+c is in D ) // dictionary

W = W + c;

else

output code(W); add W+c to D; W = c;

}

output code(W)

Pseudo Code: Decompression

Initialize dictionary D with all words of length 1.

Read all code numbers and

- output corresponding words from D,

- extend D at each step.

This time the dictionary is of the form

( integer, word )

Keys are integers, values words.

Less Pseudo

initialize D;

pc = nextcode; // first code number

x = word(pc); // corresponding word

output x;

First code number is easy: codes only a single symbol.

Remember as pc (previous code) and x (previous word).

More Less Pseudo

while ( c = nextcode ) {

if ( c is in D ) {

y = word(c);

ww = x + first(y);

insert ww in D;

output y;

}

else {

The hard case

else {

y = x + first(x);

insert y in D;

output y;

}

pc = c;

x = y;

}

One more detail…

One detail remains: how to build the dictionary for compression (decompression is easy).

We need to be able to scan through a sequence of symbols and check if they form a prefix of a word already in the dictionary.

We use tries for dictionaries.

Tries!

a b

10 32

c d

4

a

5

d

6

d

a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6

Corresponds to dictionary

Tries

In the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word.

Just add a new leaf to the last node touched.

LZW details

• In reality, one usually restricts the code words to be 12 or 16 bit integers.

• Hence, one may have to flush the dictionary ever so often (i.e. proceed to compress the rest of the input with an empty dictionary).

• But we won’t bother with this.

LZW details

Lastly, LZW generates as output a stream of integers.

It makes perfect sense to try to compress these further, e.g., by Huffman.

Summary of LZW

LZW is an adaptive, dictionary based compression method.

Encoding is easy in LZW, but uses a special data structure (trie).

Decoding is slightly complicated, requires no special data structures.

15-211 fundamental data structures and algorithms aleks nanevski february 10, 2004 based on a...

Documents

file compression

excellent compression

input file

dictionary entriesword

code treeread

code tree3

code number

optimal prefix code