lecture 29. data compression algorithms

Lecture 29.

Data Compression Algorithms

1

Commonly , algorithms are analyzed on the base probability factor such as average case in linear search.

Amortized analysis not bases on probability, nor work on single operation.

There are three methods to find the amortized cost (cost of sequence of operations) such as aggregate, accounting and potential method.

In aggregate method , average cost will be O(n)/n In accounting method, overcharge cost is assigned to

sequence of operation know as credit for that sequence and it can be used later on when amortized cost of operation is less than actual cost.

In potential method, work is same as in accounting method except data structure is considered rather than operation cost.

Recap

2

3

What is Compression?

Compression basically employs redundancy in the data:

Temporal - in 1D data, 1D signals, Audio etc. Spatial - correlation between neighbouring

pixels or data items Spectral - correlation between colour or

luminescence components. This uses the frequency domain to exploit relationships between frequency of change in data.

psycho-visual - exploit perceptual properties of the human visual system.

4

Compression can be categorised in two broad ways:

Lossless Compression where data is compressed and can be

reconstituted (uncompressed) without loss of detail or information. These are referred to as bit-preserving or reversible compression systems also.

Lossy Compression where the aim is to obtain the best possible

fidelity for a given bit-rate or minimizing the bit-rate to achieve a given fidelity measure. Video and audio compression techniques are most suited to this form of compression.

5

If an image is compressed it clearly needs to be uncompressed (decoded) before it can viewed/listened to. Some processing of data may be possible in encoded form however.

Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques.

Lossy compression use source encoding techniques that may involve transform encoding, differential encoding or vector quantisation .

Cont !!!

6

Cont !!!

7

Lossless Compression Algorithms (Repetitive Sequence Suppression)

Simple Repetition Suppression If in a sequence a series on n successive tokens

appears we can replace these with a token and a count number of occurrences. We usually need to have a special flag to denote when the repeated token appears

For Example 89400000000000000000000000000000000

can be replaced with 894f32 where f is the flag for zero.

Compression savings depend on the content of the data.

8

Applications of this simple compression technique include:

Suppression of zero's in a file (Zero Length Suppression) – Silence in audio data, Pauses in conversation

etc. – Bitmaps – Blanks in text or program source files – Backgrounds in images

other regular image or data tokens

Cont !!!

9

Run-length Encoding

This encoding method is frequently applied to images (or pixels in a scan line). It is a small compression component used in JPEG compression.

In this instance, sequences of image elements X1, X2, …, Xn are mapped to pairs (c1, l1), (c1, L2), …, (cn, ln) where ci represent image intensity or colour and li the length of the ith run of pixels (Not dissimilar to zero length suppression above).

10

For example: Original Sequence:

111122233333311112222 can be encoded as: (1,4),(2,3),(3,6),(1,4),(2,4)

The savings are dependent on the data. In the worst case

(Random Noise) encoding is more heavy than original file: 2*integer rather 1* integer if data is represented as integers.

Cont !!!

11

Lossless Compression Algorithms (Pattern Substitution)

This is a simple form of statistical encoding.

Here we substitute a frequently repeating pattern(s) with a code. The code is shorter than the pattern giving us compression.

A simple Pattern Substitution scheme could employ predefined code (for example replace all occurrences of `The' with the code '&').

12

More typically tokens are assigned to according to frequency of occurrence of patterns:

Count occurrence of tokens Sort in Descending order Assign some symbols to highest count

tokens

A predefined symbol table may used i.e. assign code i to token i.

However, it is more usual to dynamically assign codes to tokens. The entropy encoding schemes below basically attempt to decide the optimum assignment of codes to achieve the best compression.

Cont !!!

13

Lossless Compression Algorithms (Entropy Encoding)

Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques, Shannon is father of information theory.

14

The Shannon-Fano Algorithm

This is a basic information theoretic algorithm. A simple example will be used to illustrate the algorithm:

Symbol A B C D E Count 15 7 6 6 5

15

Encoding for the Shannon-Fano Algorithm:

A top-down approach 1. Sort symbols according to their

frequencies/probabilities, e.g., ABCDE. 2. Recursively divide into two parts, each

with approx. same number of counts.

Cont !!!

Introduction to LZW

As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place.

Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly.

LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.

It is the basis of many PC utilities that claim to “double the capacity of your hard drive”

LZW compression uses a code table, with 4096 as a common choice for the number of table entries.16

Introduction to LZW (Cont !!!)

Codes 0-255 in the code table are always assigned to represent single bytes from the input file.

When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks.

Compression is achieved by using codes 256 through 4095 to represent sequences of bytes.

As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table.

Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.

17

LZW Encoding Algorithm

1 Initialize table with single character strings 2 P = first input character 3 WHILE not end of input stream 4 C = next input character 5 IF P + C is in the string table 6 P = P + C 7 ELSE 8 output the code for P 9 add P + C to the string table 10 P = C 11 END WHILE

12 output code for P 18

Example 1: Compression using LZW

Example 1: Use the LZW algorithm to compress the string

BABAABAAA

19

Example 1: LZW Compression Step 1

BABAABAAA P=AC= B

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

20


BABAABAAA P=BC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

21


BABAABAAA P=AC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

22


BABAABAAA P=AC=B

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA23


BABAABAAA P=AC=A

ENCODER OUTPUT

STRING TABLE

output code

representing

codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA

65 A 260 AA24


BABAABAAA P=AAC=empty

ENCODER OUTPUT STRING TABLE

output code representing codeword string

66 B 256 BA

65 A 257 AB

256 BA 258 BAA

257 AB 259 ABA

65 A 260 AA

260 AA

25

LZW Decompression The LZW decompressor creates the same

string table during decompression.

It starts with the first 256 table entries initialized to single characters.

The string table is updated for each character in the input stream, except the first one.

Decoding achieved by reading codes and translating them through the code table being built.

26

LZW Decompression Algorithm

1 Initialize table with single character strings2 OLD = first input code3 output translation of OLD4 WHILE not end of input stream5 NEW = next input code6 IF NEW is not in the string table7 S = translation of OLD8 S = S + C9 ELSE10 S = translation of NEW11 output S12 C = first character of S13 OLD + C to the string table14 OLD = NEW15 END WHILE

27

Example 2: LZW Decompression 1

Example 2: Use LZW to decompress the output sequence of

Example 1:

<66><65><256><257><65><260>.

28

Example 2: LZW Decompression Step 1

<66><65><256><257><65><260> Old = 65 S = A

New = 66 C = AENCODER OUTPUT STRING TABLE

string codeword string

B

A 256 BA

29


<66><65><256><257><65><260> Old = 256 S = BA

New = 256 C = B ENCODER OUTPUT STRING TABLE


B

A 256 BA

BA 257 AB

30


<66><65><256><257><65><260> Old = 257 S = AB



B

A 256 BA

BA 257 AB

AB 258 BAA

31


<66><65><256><257><65><260> Old = 65 S = A



B

A 256 BA

BA 257 AB

AB 258 BAA

A 259 ABA

32


<66><65><256><257><65><260> Old = 260 S = AA



B

A 256 BA

BA 257 AB

AB 258 BAA

A 259 ABA

AA 260 AA33

LZW: Some Notes

This algorithm compresses repetitive sequences of data well.

Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it.

In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression improves dramatically.

Advantages of LZW over Huffman:– LZW requires no prior information about the input data

stream. – LZW can compress the input stream in one single pass.– Another advantage of LZW its simplicity, allowing fast

execution.

34

LZW: Limitations

What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)?

Here are some options usually implemented:

– Simply forget about adding any more entries and use the table as is.

– Throw the dictionary away when it reaches a certain size.

– Throw the dictionary away when it is no longer effective at compression.

– Clear entries 256-4095 and start building the dictionary again.

Some clever schemes rebuild a string table from the last N input characters.

35

Home Work

Use LZW to trace encoding the string ABRACADABRA.

Write a Java program that encodes a given string using LZW.

36

Summary

Data compression is a technique to compress the data represented either in text, audio or image form.

Two important compress techniques are lossy and lossless compression.

LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.

LZW compression uses a code table, with 4096 as a common choice for the number of table entries.37

In Next Lecture

In next lecturer , we will discuss the P, NP complete and NP Hard problems

38

lecture 29. data compression algorithms

Documents