algorithms in the real world · algorithms in the real world lempel-ziv burroughs-wheeler acb. page...

27
Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB

Upload: others

Post on 04-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 1

Algorithms in the Real World

Lempel-ZivBurroughs-Wheeler

ACB

Page 2: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 2

Compression OutlineIntroduction: Lossy vs. Lossless, Benchmarks, …Information Theory: Entropy, etc.Probability Coding: Huffman + Arithmetic CodingApplications of Probability Coding: PPM + othersLempel-Ziv Algorithms:

– LZ77, gzip, – LZ78, compress (Not covered in class)

Other Lossless Algorithms: Burrows-WheelerLossy algorithms for images: JPEG, MPEG, ...Compressing graphs and meshes: BBK

Page 3: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 3

Lempel-Ziv AlgorithmsLZ77 (Sliding Window)

Variants: LZSS (Lempel-Ziv-Storer-Szymanski)Applications: gzip, Squeeze, LHA, PKZIP, ZOO

LZ78 (Dictionary Based)Variants: LZW (Lempel-Ziv-Welch), LZC Applications: compress, GIF, CCITT (modems),

ARC, PAK

Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

Page 4: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 4

LZ77: Sliding Window Lempel-Ziv

Dictionary and buffer “windows” are fixed length and slide with the cursor

Repeat:Output (o, l, c) where

o = position (offset) of the longest match that starts in the dictionary (relative to the cursor)l = length of longest matchc = next char in buffer beyond longest match

Advance window by l + 1

a a c a a c a b c a b a b a c

Dictionary(previously coded)

LookaheadBuffer

Cursor

Page 5: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 5

LZ77: Examplea a c a a c a b c a b a a a c (_,0,a)

a a c a a c a b c a b a a a c (1,1,c)

a a c a a c a b c a b a a a c (3,4,b)

a a c a a c a b c a b a a a c (3,3,a)

Dictionary (size = 6) Longest match

Next characterBuffer (size = 4)

a a c a a c a b c a b a a a c (1,2,c)

Page 6: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 6

LZ77 DecodingDecoder keeps same dictionary window as encoder.For each message it looks it up in the dictionary and

inserts a copy at the end of the stringWhat if l > o? (only part of the message is in the

dictionary.)E.g. dict = abcd, codeword = (2,9,e)

• Simply copy from left to rightfor (i = 0; i < length; i++)out[cursor+i] = out[cursor-offset+i]

• Out = abcdcdcdcdcdce• First character is in the dictionary, and each time

a character is read, another is written, so never run out of characters to read.

Page 7: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 7

LZ77 Optimizations used by gzipLZSS: Output one of the following two formats

(0, position, length) or (1,char)Uses the second format if length < 3.

a a c a a c a b c a b a a a c (1,a)

a a c a a c a b c a b a a a c (1,a)

a a c a a c a b c a b a a a c (0,3,4)

a a c a a c a b c a b a a a c (1,c)

Page 8: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 8

Optimizations used by gzip (cont.)1. Huffman code the positions, lengths, and chars2. Non greedy: possibly use shorter match so that

next match is better3. Use a hash table to store the dictionary.

– Hash keys are all strings of length 3 in the dictionary window.

– Find the longest match within the correct hash bucket.

– Puts a limit on the length of the search within a bucket.

– Within each bucket store in order of position to make deleting easier when window moves.

Page 9: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 9

The Hash Table

a a c a a c a b c a b a a a c

7 8 9 101112131415161718192021… …

……

a a c 19

a a c 10

a a c 7 a c a 8

a c a 11

c a a 9

c a b 15

c a b 12

Page 10: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 10

Theory behind LZ77The Sliding Window Lempel-Ziv Algorithm is Asymptotically Optimal,

A. D. Wyner and J. Ziv, Proceedings of the IEEE, Vol. 82. No. 6, June 1994.

Will compress long enough strings to the source entropy as the window size goes to infinity.

])[1)(1(][ iiii oEppoE +−+=

Special case of proof: Assume infinite dictionary window, characters generated one-at-a-time independently, look for matches of length one only.

If i’th character in alphabet occurs with probability pi, expected distance to nearest match is given by (like tossing coin until heads appears)

Solving, we find that E[oi] = 1/pi, which is intuitive. If we can encode oi using ≈ log oi bits, then the expected codewordlength is

∑≈i i

i pp 1log

Which is the entropy of the single character distribution.

Page 11: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Logarithmic Length EncodingHow can we encode oi using ≈ log oi bits?

Gamma code uses two parts: First ⎡log oi ⎤ is encoded in unary using ⎡log oi ⎤ bits. Then oi is encoded in binary notation using ⎡log oi ⎤bits. Total length of codeword: 2 ⎡log oi ⎤. Off by a factor of 2!

Improvement: Code ⎡log oi ⎤ in binary using ⎡log ⎡ log oi ⎤ ⎤ bits instead, and first code ⎡log ⎡ log oi ⎤ ⎤ in unary using ⎡log ⎡ log oi ⎤ ⎤bits. Total length of codeword: 2⎡log ⎡ log oi ⎤ ⎤ + ⎡log oi ⎤ .

Etc.: 2 ⎡log ⎡log ⎡ log oi ⎤ ⎤ ⎤ +⎡log ⎡ log oi ⎤ ⎤ + ⎡log oi ⎤ …

Page 11

Page 12: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 12

Theory behind LZ77General case: for any n, there is a window size w (typically

exponential in n) such that on average a match of size n is found, and the number of bits needed to encode the position and length of the match approaches the source entropy for substrings of length n:

∑∈

=nAX

n XpXpH

)(1log)(

Optimal in situation in which each character might depend on previous n-1, even though n is unknown.

Uses logarithmic code for the position and match length. (Note that typically match length is short compared to position.)

Problem: “long enough” window is really really long.

Page 13: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 13

Comparison to Lempel-Ziv 78Both LZ77 and LZ78 and their variants keep a

“dictionary” of recent strings that have been seen.The differences are:

– How the dictionary is stored (LZ78 is a trie)– How it is extended (LZ78 only extends an existing

entry by one character)– How it is indexed (LZ78 indexes the nodes of the

trie)– How elements are removed

Page 14: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 14

Lempel-Ziv Algorithms SummaryAdapts well to changes in the file (e.g. a tar file with

many file types within it).Initial algorithms did not use probability coding and

performed poorly in terms of compression. More modern versions (e.g. gzip) do use probability coding as “second pass” and compress much better.

The algorithms are becoming outdated, but ideas are used in many of the newer algorithms.

Page 15: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 15

Compression OutlineIntroduction: Lossy vs. Lossless, Benchmarks, …Information Theory: Entropy, etc.Probability Coding: Huffman + Arithmetic CodingApplications of Probability Coding: PPM + othersLempel-Ziv Algorithms: LZ77, gzip, compress, …Other Lossless Algorithms:

– Burrows-Wheeler– ACB

Lossy algorithms for images: JPEG, MPEG, ...Compressing graphs and meshes: BBK

Page 16: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 16

Burrows -WheelerCurrently near best “balanced” algorithm for textBreaks file into fixed-size blocks and encodes each

block separately.For each block:

– First sort each character by its full context. This is called the block sorting transform.

– Then use the move-to-front transform to encode the sorted characters.

The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

Page 17: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 17

Burrows Wheeler: ExampleLet’s encode: decodeContext “wraps” around. Last char is most significant.In the output, characters with similar contexts are

near each other.

Context Char ecode d coded e odede c dedec o edeco d decod e

Context Output dedec o coded e decod e odede c ecode d ⇐ edeco d

sort usingcontext askey

All rotations of input

start

Page 18: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 18

Burrows Wheeler DecodingKey Idea: Can construct entire sorted table from sorted column alone! First: sorting the output gives last column of context:

Context Outputc o

d e

d e

e c

e d

o d

Page 19: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 19

Burrows Wheeler DecodingNow sort pairs in last column of context and output column to form last two columns of context:

Context Outputc o

d e

d e

e c

e d

o d

Context Outputec o

ed e

od e

de c

de d

co d

Page 20: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 20

Burrows Wheeler DecodingRepeat until entire table is complete. Pointer to first character provides unique decoding.

Context Output dedec o coded e decod e odede c ecode d ⇐ edeco d

Message was d in first position, preceded in wrapped fashion by ecode: decode.

Page 21: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 21

Burrows Wheeler DecodingOptimization: Don’t really have to rebuild the whole context table.

Context Output dedec o coded1 e1 decod2 e2 odede1 c ecode2 d1 ⇐ edeco d2

What character comes after the first character, d1?

Just have to find d1 in last column of context and see what follows it: e1.

Observation: instances of same character of output appear in same order in last column of context. (Proof is an exercise.)

Page 22: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 22

Burrows-Wheeler: DecodingOutputo

e

e

c

d

d

Contextc

d

d

e

e

o

Rank6

4

5

1

2

3

The “rank” is the position of a character if it were sorted using a stable sort.

Page 23: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 23

Burrows-Wheeler DecodeFunction BW_Decode(In, Start, n)

S = MoveToFrontDecode(In,n)R = Rank(S)j = Startfor i=1 to n do

Out[i] = S[j]j = R[j]

Rank gives position of each char in sorted order.

Page 24: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 24

Decode Example

Out d1 ⇐ e2 c3 o4 d5

e6

S Rank(S)o4 6e2 4e6 5c3 1d1 2d5 3

Page 25: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 25

Overview of Text CompressionPPM and Burrows-Wheeler both encode a single

character based on the immediately preceding context.

LZ77 and LZ78 encode multiple characters based on matches found in a block of preceding text

Can you mix these ideas, i.e., code multiple characters based on immediately preceding context?– BZ does this, but they don’t give details on how

it works– ACB also does this – close to BZ

Page 26: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

Page 26

ACB (Associate Coder of Buyanovsky)Keep dictionary sorted by context (the last character

is the most significant)• Find longest match of current context in context

part of dictionary • Find longest match of look-ahead buffer in contents

part of dictionary• Code

• Distance between matches in the sorted order• Length of contents match• Next character that doesn’t match

• Shift look-ahead window, update context, contents

Has aspects of Burrows-Wheeler and LZ77

Page 27: Algorithms in the Real World · Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB. Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks,

ACB Example

Page 27

Have seen “decode” so far. Sort (by context) all places in dictionary (contents) that a match for text in look-ahead buffer could occur.

Context Contents decode

dec ode d ecode

decod de

e code

decode deco de

Suppose current position isdecode|odcoe

Best Matches: “decode” in context“odc” in contents (2 chars)

Output -4,2,c

Decoder can find best match in context, go from there.