dictionary based compression

40
1 CHAPTER 6 Compression Techniques Objectives: Able to perform data compression Able to use different compression techniques

Upload: anithabalaprabhu

Post on 01-Dec-2014

12.669 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Dictionary Based Compression

1

CHAPTER 6

Compression Techniques

Objectives:

Able to perform data compressionAble to use different compression techniques

Page 2: Dictionary Based Compression

2

Introduction What is Compression?

Data compression requires the identification and extraction of source redundancy.In other words, data compression seeks to reduce the number of bits used to store or transmit information.There are a wide range of compression methodswhich can be so unlike one another that they have little in common except that they compress data.

The Need For CompressionIn terms of storage, the capacity of a storage device can be effectively increased with methods that compresses a body of data on its way to a storage device and decompresses it when it is retrieved.In terms of communications, the bandwidth of a digital communication link can be effectively increased by compressing data at the sending end and decompressing data at the receiving end.

Page 3: Dictionary Based Compression

3

A Brief History of DataCompression

The late 40's were the early years of Information Theory, the idea of developing efficient new coding methods was just starting to be fleshed out. Ideas of entropy, information content and redundancy were explored.One popular notion held that if the probability of symbols in a message were known, there ought to be a way to code the symbols so that the message will take up less space.

A Brief History of DataCompression

The first well-known method for compressing digital signals is now known as Shannon- Fanocoding. Shannon and Fano [~1948] simultaneously developed this algorithm which assigns binary codewords to unique symbols that appear within a given data file.

While Shannon-Fano coding was a great leap forward, it had the unfortunate luck to be quickly superseded by an even more efficient coding system : Huffman Coding.

Page 4: Dictionary Based Compression

4

Huffman coding [1952] shares most characteristics of Shannon-Fano coding.Huffman coding could perform effective data compression by reducing the amount of redundancy in the coding of symbols. It has been proven to be the most efficient fixed-length coding method available.

A Brief History of DataCompression

In the last fifteen years, Huffman coding has been replaced by arithmetic coding.Arithmetic coding bypasses the idea of replacing an input symbol with a specific code.It replaces a stream of input symbols with a single floating-point output number.More bits are needed in the output number for longer, complex messages.

A Brief History of DataCompression

Page 5: Dictionary Based Compression

5

A Brief History of DataCompression

Terminology• Compressor–Software (or hardware) device that compresses data

• Decompressor–Software (or hardware) device that decompresses data

• Codec–Software (or hardware) device that compresses and decompresses data

• Algorithm–The logic that governs the compression/decompression process

Page 6: Dictionary Based Compression

6

Compression can be categorized in two broad ways:

Lossless compressionrecover the exact original data after compression.mainly use for compressing database records, spreadsheets or word processing files, where exact replication of the original is essential.

Lossy compression will result in a certain loss of accuracy in exchange for a substantial increase in compression. more effective when used to compress graphic images and digitised voice where losses outside visual or aural perception can be tolerated.Most lossy compression techniques can be adjusted to different quality levels, gaining higher accuracy in exchange for less effective compression.

Compression can be categorized in two broad ways:

Page 7: Dictionary Based Compression

7

Lossless CompressionAlgorithms:

Lossless CompressionAlgorithms:

Dictionary-based compression algorithmsRepetitive Sequence SuppressionRun-length Encoding*Pattern SubstitutionEntropy Encoding*

The Shannon-Fano AlgorithmHuffman Coding*Arithmetic Coding*

Page 8: Dictionary Based Compression

8

Dictionary-based compression algorithms use a completely different method to compress data.They encode variable-length strings of symbols as single tokens.The token forms an index to a phrase dictionary. If the tokens are smaller than the phrases, they replace the phrases and compression occurs.

Dictionary-based compression algorithms

Dictionary-based compression algorithms

Suppose we want to encode the Oxford Concise English dictionary which contains about 159,000 entries. Why not just transmit each word as an 18 bit number? Problems:

Too many bits, everyone needs a dictionary, only works for English text.

Solution: Find a way to build the dictionary adaptively.

Page 9: Dictionary Based Compression

9

Two dictionary- based compression techniques called LZ77 and LZ78 have been developed.LZ77 is a "sliding window" technique in which the dictionary consists of a set of fixed- length phrases found in a "window" into the previously seen text.LZ78 takes a completely different approach to building a dictionary. Instead of using fixedlength phrases from a window into the text, LZ78 builds phrases up one symbol at a time, adding a new symbol to an existing phrase when a match occurs.

Dictionary-based compression algorithms

Dictionary-based compression algorithms

The LZW Compression Algorithm can summarised as follows:

Page 10: Dictionary Based Compression

10

Example

Dictionary-based compression algorithms

The LZW decompression Algorithm can summarised as follows:

Page 11: Dictionary Based Compression

11

Example:

What if we run out of dictionary space? Solution 1: Keep track of unused entries and use LRU Solution 2: Monitor compression performance and flush dictionary when performance is poor.

Dictionary-based compression algorithms Problem

Page 12: Dictionary Based Compression

12

RepetitiveSequence Suppression

Fairly straight forward to understand and implement.Simplicity is their downfall: NOT best compression ratios.Some methods have their applications, e.g.Component of JPEG, Silence Suppression.

Repetitive Sequence Suppression

If a sequence a series on & successive tokens appears

Replace series with a token and a count number of occurrences.Usually need to have a special flag to denote when the repeated token appears

Example89400000000000000000000000000000000we can replace with 894f32, where f is the flag for zero.

Page 13: Dictionary Based Compression

13

How Much Compression?Compression savings depend on the content of the data.

Applications of this simple compression technique include:

Suppression of zero’s in a file (Zero Length Suppression)Silence in audio data, Pauses in conversation etc.BitmapsBlanks in text or program source filesBackgrounds in imagesOther regular image or data tokens

Repetitive Sequence Suppression

Run-length Encoding

Page 14: Dictionary Based Compression

14

Run-length Encoding

Run-length Encoding

Page 15: Dictionary Based Compression

15

Run-length Encoding

UncompressBlue White White White White White White Blue White Blue White White White White White Blue etc.

Compress1XBlue 6XWhite 1XBlue1XWhite 1XBlue 4Xwhite 1XBlue 1XWhiteetc.

Run-length Encoding

Page 16: Dictionary Based Compression

16

Run-length Encoding

Pattern Substitution

Page 17: Dictionary Based Compression

17

Entropy Encoding

The Shannon-Fano Coding

To create a code tree according to Shannon and Fano an ordered table is required providing the frequency of any symbol. Each part of the table will be divided into two segments. The algorithm has to ensure that either the upper and the lower part of the segment have nearly the same sum of frequencies. This procedure will be repeated until only single symbols are left.

Page 18: Dictionary Based Compression

18

The Shannon-FanoAlgorithm

The Shannon-FanoAlgorithm

Page 19: Dictionary Based Compression

19

The Shannon-FanoAlgorithm

The Shannon-FanoAlgorithm

Page 20: Dictionary Based Compression

20

The Shannon-FanoAlgorithm

Example Shannon-FanoCoding

Page 21: Dictionary Based Compression

21

SYMBOL FREQ SUM CODE SUM CODE SUM CODESTEP 1 STEP 2 STEP 3

Example Shannon-FanoCoding

811

11

Example Shannon-FanoCoding

Page 22: Dictionary Based Compression

22

Example Shannon-FanoCoding

Huffman Coding

Page 23: Dictionary Based Compression

23

Huffman Coding

Huffman Coding

Page 24: Dictionary Based Compression

24

Huffman Coding

Huffman Coding

Page 25: Dictionary Based Compression

25

Huffman Coding

Huffman Coding

Page 26: Dictionary Based Compression

26

Huffman Coding

Huffman Coding

Page 27: Dictionary Based Compression

27

Huffman Code: Example

Huffman Code: Example

Page 28: Dictionary Based Compression

28

Huffman Code: Example

Huffman Coding

Page 29: Dictionary Based Compression

29

Huffman Coding

Huffman Coding

Page 30: Dictionary Based Compression

30

Arithmetic Coding

Arithmetic Coding

Page 31: Dictionary Based Compression

31

Arithmetic Coding

Arithmetic Coding

Page 32: Dictionary Based Compression

32

Arithmetic Coding

Arithmetic Coding

Page 33: Dictionary Based Compression

33

Arithmetic Coding

Arithmetic Coding

Page 34: Dictionary Based Compression

34

Arithmetic Coding

Arithmetic Coding

Page 35: Dictionary Based Compression

35

Arithmetic Coding

Arithmetic Coding

Page 36: Dictionary Based Compression

36

Arithmetic Coding

How to translate range to bit

Example: 1. BACA

low = 0.59375, high = 0.60937.2. CAEE$

low = 0.33184, high = 0.3322.

Page 37: Dictionary Based Compression

37

Decimal

0.12345

54321

54321

105104103102101 10

510

410

310

210112345.0

−−−−− ×+×+×+×+×=

++++=

x 10-5

x 10-4

x 10-3

x 10-2

x 10-1

Binary

0.01010

8642

87654321

2121212121

20

21

20

21

20

21

20

01010101.0

−−−− ×+×+×+×=

+++++++=

x 2-5

x 2-4

x 2-3

x 2-2

x 2-1

Page 38: Dictionary Based Compression

38

102

102

102

102

102

03125.000001.00625.00001.0

125.0001.025.001.0

5.01.0

=====

Binary to decimal

What is a value of 0.010101012 in

decimal?

0.033203125

BEGIN

code=0;

k=1;

while( value(code) < low )

{

assign 1 to the k-th binary fraction bit;

if ( value(code) > high)

replace the k-th bit by 0;

k = k + 1;

}

END

Generating codeword for encoder

[0.33184,0.33220]

Page 39: Dictionary Based Compression

39

BEGIN

code=0;

k=1;

while( value(code) < 0.33184 )

{

assign 1 to the k-th binary fraction bit;

if ( value(code) > 0.33220 )

replace the k-th bit by 0;

k = k + 1;

}

END

Example1Range (0.33184,0.33220)

Decimal

Binary

1. Assign 1 to the first fraction (codeword=0.12) and compare with low (0.3318410)

value(0.12=0.510)> 0.3318410 -> out of rangeHence, we assign 0 for the first bit. value(0.02)< 0.3318410 -> while loop continue

2. Assign 1 to the second fraction (0.012) =0.2510 which is less then high (0.33220)

Example1Range (0.33184,0.33220)

Page 40: Dictionary Based Compression

40

3. Assign 1 to the third fraction (0.0112) =0.2510+ 0.12510 = 0.37510 which is bigger then high (0.33220), so replace the kbit by 0. Now the codeword = 0.0102

4. Assign 1 to the fourth fraction (0.01012) = 0.2510 + 0.062510 =0.312510 which is less then high (0.33220). Now the codeword = 0.01012

5. Continue…

Example1Range (0.33184,0.33220)

Eventually, the binary codeword generate is 0.01010101 which 0.0332031258 bit binary represent CAEE$

Example1Range (0.33184,0.33220)