lecture 29. data compression algorithms

38
Lecture 29. Data Compression Algorithms 1

Upload: ivi

Post on 17-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Lecture 29. Data Compression Algorithms. Recap. Commonly , algorithms are analyzed on the base probability factor such as average case in linear search. Amortized analysis not bases on probability, nor work on single operation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 29.                          Data Compression Algorithms

Lecture 29.

Data Compression Algorithms

1

Page 2: Lecture 29.                          Data Compression Algorithms

Commonly , algorithms are analyzed on the base probability factor such as average case in linear search.

Amortized analysis not bases on probability, nor work on single operation.

There are three methods to find the amortized cost (cost of sequence of operations) such as aggregate, accounting and potential method.

In aggregate method , average cost will be O(n)/n In accounting method, overcharge cost is assigned to

sequence of operation know as credit for that sequence and it can be used later on when amortized cost of operation is less than actual cost.

In potential method, work is same as in accounting method except data structure is considered rather than operation cost.

Recap

2

Page 3: Lecture 29.                          Data Compression Algorithms

3

What is Compression?

Compression basically employs redundancy in the data:

Temporal - in 1D data, 1D signals, Audio etc. Spatial - correlation between neighbouring

pixels or data items Spectral - correlation between colour or

luminescence components. This uses the frequency domain to exploit relationships between frequency of change in data.

psycho-visual - exploit perceptual properties of the human visual system.

Page 4: Lecture 29.                          Data Compression Algorithms

4

Compression can be categorised in two broad ways:

Lossless Compression where data is compressed and can be

reconstituted (uncompressed) without loss of detail or information. These are referred to as bit-preserving or reversible compression systems also.

Lossy Compression where the aim is to obtain the best possible

fidelity for a given bit-rate or minimizing the bit-rate to achieve a given fidelity measure. Video and audio compression techniques are most suited to this form of compression.

Page 5: Lecture 29.                          Data Compression Algorithms

5

If an image is compressed it clearly needs to be uncompressed (decoded) before it can viewed/listened to. Some processing of data may be possible in encoded form however.

Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques.

Lossy compression use source encoding techniques that may involve transform encoding, differential encoding or vector quantisation .

Cont !!!

Page 6: Lecture 29.                          Data Compression Algorithms

6

Cont !!!

Page 7: Lecture 29.                          Data Compression Algorithms

7

Lossless Compression Algorithms (Repetitive Sequence Suppression)

Simple Repetition Suppression If in a sequence a series on n successive tokens

appears we can replace these with a token and a count number of occurrences. We usually need to have a special flag to denote when the repeated token appears

For Example 89400000000000000000000000000000000

can be replaced with 894f32 where f is the flag for zero.

Compression savings depend on the content of the data.

Page 8: Lecture 29.                          Data Compression Algorithms

8

Applications of this simple compression technique include:

Suppression of zero's in a file (Zero Length Suppression) – Silence in audio data, Pauses in conversation

etc. – Bitmaps – Blanks in text or program source files – Backgrounds in images

other regular image or data tokens

Cont !!!

Page 9: Lecture 29.                          Data Compression Algorithms

9

Run-length Encoding

This encoding method is frequently applied to images (or pixels in a scan line). It is a small compression component used in JPEG compression.

In this instance, sequences of image elements X1, X2, …, Xn are mapped to pairs (c1, l1), (c1, L2), …, (cn, ln) where ci represent image intensity or colour and li the length of the ith run of pixels (Not dissimilar to zero length suppression above).

Page 10: Lecture 29.                          Data Compression Algorithms

10

For example: Original Sequence:

111122233333311112222 can be encoded as: (1,4),(2,3),(3,6),(1,4),(2,4)

The savings are dependent on the data. In the worst case

(Random Noise) encoding is more heavy than original file: 2*integer rather 1* integer if data is represented as integers.

Cont !!!

Page 11: Lecture 29.                          Data Compression Algorithms

11

Lossless Compression Algorithms (Pattern Substitution)

This is a simple form of statistical encoding.

Here we substitute a frequently repeating pattern(s) with a code. The code is shorter than the pattern giving us compression.

A simple Pattern Substitution scheme could employ predefined code (for example replace all occurrences of `The' with the code '&').

Page 12: Lecture 29.                          Data Compression Algorithms

12

More typically tokens are assigned to according to frequency of occurrence of patterns:

Count occurrence of tokens Sort in Descending order Assign some symbols to highest count

tokens

A predefined symbol table may used i.e. assign code i to token i.

However, it is more usual to dynamically assign codes to tokens. The entropy encoding schemes below basically attempt to decide the optimum assignment of codes to achieve the best compression.

Cont !!!

Page 13: Lecture 29.                          Data Compression Algorithms

13

Lossless Compression Algorithms (Entropy Encoding)

Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques, Shannon is father of information theory.

Page 14: Lecture 29.                          Data Compression Algorithms

14

The Shannon-Fano Algorithm

This is a basic information theoretic algorithm. A simple example will be used to illustrate the algorithm:

Symbol A B C D E Count 15 7 6 6 5

Page 15: Lecture 29.                          Data Compression Algorithms

15

Encoding for the Shannon-Fano Algorithm:

A top-down approach 1. Sort symbols according to their

frequencies/probabilities, e.g., ABCDE. 2. Recursively divide into two parts, each

with approx. same number of counts.

Cont !!!

Page 16: Lecture 29.                          Data Compression Algorithms

Introduction to LZW

As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place.

Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly.

LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.

It is the basis of many PC utilities that claim to “double the capacity of your hard drive”

LZW compression uses a code table, with 4096 as a common choice for the number of table entries.16

Page 17: Lecture 29.                          Data Compression Algorithms

Introduction to LZW (Cont !!!)

Codes 0-255 in the code table are always assigned to represent single bytes from the input file.

When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks.

Compression is achieved by using codes 256 through 4095 to represent sequences of bytes.

As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table.

Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.

17

Page 18: Lecture 29.                          Data Compression Algorithms

LZW Encoding Algorithm

1 Initialize table with single character strings 2 P = first input character 3 WHILE not end of input stream 4 C = next input character 5 IF P + C is in the string table 6 P = P + C 7 ELSE 8   output the code for P 9 add P + C to the string table 10 P = C 11 END WHILE

12 output code for P 18

Page 19: Lecture 29.                          Data Compression Algorithms

Example 1: Compression using LZW

Example 1: Use the LZW algorithm to compress the string

BABAABAAA

19

Page 20: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 1

BABAABAAA P=AC= B

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

20

Page 21: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 2

BABAABAAA P=BC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

21

Page 22: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 3

BABAABAAA P=AC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

22

Page 23: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 4

BABAABAAA P=AC=B

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA23

Page 24: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 5

BABAABAAA P=AC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA

65 A 260 AA24

Page 25: Lecture 29.                          Data Compression Algorithms

Example 1: LZW Compression Step 6

BABAABAAA P=AAC=empty

ENCODER OUTPUT STRING TABLE

output code representing codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA

65 A 260 AA

260 AA

25

Page 26: Lecture 29.                          Data Compression Algorithms

LZW Decompression The LZW decompressor creates the same

string table during decompression.

It starts with the first 256 table entries initialized to single characters.

The string table is updated for each character in the input stream, except the first one.

Decoding achieved by reading codes and translating them through the code table being built.

26

Page 27: Lecture 29.                          Data Compression Algorithms

LZW Decompression Algorithm

1 Initialize table with single character strings2 OLD = first input code3 output translation of OLD4 WHILE not end of input stream5 NEW = next input code6  IF NEW is not in the string table7 S = translation of OLD8   S = S + C9 ELSE10  S = translation of NEW11 output S12   C = first character of S13   OLD + C to the string table14 OLD = NEW15 END WHILE

27

Page 28: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression 1

Example 2: Use LZW to decompress the output sequence of

Example 1:

<66><65><256><257><65><260>.

28

Page 29: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression Step 1

<66><65><256><257><65><260> Old = 65 S = A

New = 66 C = AENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

29

Page 30: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression Step 2

<66><65><256><257><65><260> Old = 256 S = BA

New = 256 C = B ENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

BA 257 AB

30

Page 31: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression Step 3

<66><65><256><257><65><260> Old = 257 S = AB

New = 257 C = AENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

BA 257 AB

AB 258 BAA

31

Page 32: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression Step 4

<66><65><256><257><65><260> Old = 65 S = A

New = 65 C = AENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

BA 257 AB

AB 258 BAA

A 259 ABA

32

Page 33: Lecture 29.                          Data Compression Algorithms

Example 2: LZW Decompression Step 5

<66><65><256><257><65><260> Old = 260 S = AA

New = 260 C = AENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

BA 257 AB

AB 258 BAA

A 259 ABA

AA 260 AA33

Page 34: Lecture 29.                          Data Compression Algorithms

LZW: Some Notes

This algorithm compresses repetitive sequences of data well.

Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it.

In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression improves dramatically.

Advantages of LZW over Huffman:– LZW requires no prior information about the input data

stream. – LZW can compress the input stream in one single pass.– Another advantage of LZW its simplicity, allowing fast

execution.

34

Page 35: Lecture 29.                          Data Compression Algorithms

LZW: Limitations

What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)?

Here are some options usually implemented:

– Simply forget about adding any more entries and use the table as is.

– Throw the dictionary away when it reaches a certain size.

– Throw the dictionary away when it is no longer effective at compression.

– Clear entries 256-4095 and start building the dictionary again.

Some clever schemes rebuild a string table from the last N input characters.

35

Page 36: Lecture 29.                          Data Compression Algorithms

Home Work

Use LZW to trace encoding the string ABRACADABRA.

Write a Java program that encodes a given string using LZW.

36

Page 37: Lecture 29.                          Data Compression Algorithms

Summary

Data compression is a technique to compress the data represented either in text, audio or image form.

Two important compress techniques are lossy and lossless compression.

LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.

LZW compression uses a code table, with 4096 as a common choice for the number of table entries.37

Page 38: Lecture 29.                          Data Compression Algorithms

In Next Lecture

In next lecturer , we will discuss the P, NP complete and NP Hard problems

38