data compretion

1

S

I

L

I

C

O

N

MAIN TOPIC

NAME:Rashmi kanta mohapatra

ROLL.No:052

2

S

I

L

I

C

O

N

contents IntroductionIntroduction What ,whenWhat ,when Some questionSome question UsesUses Major stepsMajor steps Type of data compressionType of data compression disadvantagesdisadvantages conclusionconclusion

3

S

I

L

I

C

O

N

INTRODUCTIONData Compression What:Data Compression What: As name implies, makes your data smaller, saving space Looks for repetitive sequences or patterns in data - e.g. the

the quick the brown fox the We are more repetitive than we think - text often

compresses over 50% Lossless vs. lossy

4

S

I

L

I

C

O

N

Data Compression - WHY Most data from nature has redundancy There is more data than the actual information contained

in the data. Squeezing out the excess data amounts to compression. However, unsqeezing out is necessary to be able to figure

out what the data means. Always possible to compress? Consider a two-bit sequence. Can you always compress it to one bit? the limits of compression and give clues on

how to compress well.

5

S

I

L

I

C

O

N

Question:

Question:Question: Why do we want to make files smaller? Why do we want to make files smaller?

Answer:Answer: To use less storage, i.e., saving costsTo use less storage, i.e., saving costs To transmit these files faster, decreasing access To transmit these files faster, decreasing access

time or using the same access time, but with a time or using the same access time, but with a lower and cheaper bandwidthlower and cheaper bandwidth

To process the file sequentially faster.To process the file sequentially faster.

6

S

I

L

I

C

O

N

MAJOR STEPS

Uncompress Uncompress Preparation Preparation Quantization Quantization Entropy Entropy Compress data Compress dataData Encoding Data Encoding

7

S

I

L

I

C

O

N

Preparation:-Preparation:-It include analog to digital conversion It include analog to digital conversion and generating appropriate digital representation and generating appropriate digital representation of the information. An image is divided into of the information. An image is divided into blacks of 8/8 pixels, and represented by affix no. blacks of 8/8 pixels, and represented by affix no. of bit per pixel.of bit per pixel.

Processing:-Processing:-It is 1st stage of compression process It is 1st stage of compression process which make use sophisticated algorithms.which make use sophisticated algorithms.

Quantization:-Quantization:-It is the result of previous step. It It is the result of previous step. It specifies the granularity of the mapping of real specifies the granularity of the mapping of real number into integer number. This process results number into integer number. This process results in a reduction of precision.in a reduction of precision.

Entropy encoding: -Entropy encoding: - It is the last step. It It is the last step. It compresses a sequential digital data stream compresses a sequential digital data stream without loss. For ex:-compress sequence of without loss. For ex:-compress sequence of zeroes specifying the no. of occurrence.zeroes specifying the no. of occurrence.

8

S

I

L

I

C

O

N

USES OF DATA COMPRESSION More and more data is being stored electronically. Digital More and more data is being stored electronically. Digital

video libraries, for example, contain vast amounts of data, video libraries, for example, contain vast amounts of data, and compression allows cost-effective storage of the data. and compression allows cost-effective storage of the data.

New technology has allowed the possibility of interactive New technology has allowed the possibility of interactive digital television and the demand is for high-quality digital television and the demand is for high-quality transmissions, a wide selection of programs to choose transmissions, a wide selection of programs to choose from and inexpensive hardware. But for digital television from and inexpensive hardware. But for digital television to be a success, it must use data compression [Saxton, to be a success, it must use data compression [Saxton, 1996]. 1996]. Data compression reduces the number of bits Data compression reduces the number of bits required to represent or transmit information.required to represent or transmit information.

9

S

I

L

I

C

O

N

TYPES OF DATA COMPRESSION Entropy encodingEntropy encoding -- lossless. Data considered a -- lossless. Data considered a

simple digital sequence and semantics of data are simple digital sequence and semantics of data are ignored.ignored.

Source encodingSource encoding -- lossy. Takes semantics of data -- lossy. Takes semantics of data into account. Amount of compression depends on into account. Amount of compression depends on data contents.data contents.

Hybrid encodingHybrid encoding -- combination of entropy and -- combination of entropy and source. Most multimedia systems use these. source. Most multimedia systems use these.

10

S

I

L

I

C

O

N

TYPES OF DATA COMPRESSION Entropy encodingEntropy encoding -- lossless. -- lossless.

Data in data stream considered a simple digital sequence and Data in data stream considered a simple digital sequence and semantics of data are ignored.semantics of data are ignored.

Short Code words for frequently occurring symbols. Longer Short Code words for frequently occurring symbols. Longer Code words for more infrequently occurring symbolsCode words for more infrequently occurring symbols

For example: E occurs frequently in English, so we should For example: E occurs frequently in English, so we should give it a shorter code than Qgive it a shorter code than Q

Examples of Entropy Encoding:Examples of Entropy Encoding: Loss less data compressionLoss less data compression Huffman codingHuffman coding Arithmetic codingArithmetic coding

11

S

I

L

I

C

O

N

LOSSLESS DATA COMPRESSION Run-Length CodingRun-Length Coding

RunsRuns (sequences) of data are stored as a single value (sequences) of data are stored as a single value and count, rather than the individual run.and count, rather than the individual run.

Example:Example: ThisThis::

• WWWWWWWWWWWWBWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Becomes:Becomes:• 12WB12W3B24WB14W12WB12W3B24WB14W

12

S

I

L

I

C

O

N

Data is not lost - the original is really needed.Data is not lost - the original is really needed. text compression.text compression. compression of computer binaries to fit on a compression of computer binaries to fit on a

floppy.floppy. Compression ratio typically 2:1 to 8:1Compression ratio typically 2:1 to 8:1 . .

lossless compression on many kinds of files.lossless compression on many kinds of files. Statistical Techniques:Statistical Techniques: Huffman coding.Huffman coding. Arithmetic coding.Arithmetic coding. Dictionary techniques:Dictionary techniques: LZW, LZ77.LZW, LZ77. Standards - Morse code, Braille, Unix compress, Standards - Morse code, Braille, Unix compress,

gzip,gzip, zip, bzip, GIF, PNG, JBIG, Lossless JPEG.zip, bzip, GIF, PNG, JBIG, Lossless JPEG.

13

S

I

L

I

C

O

N

SHANNON-FANO COADING Shannon lossless source coding theorem is Shannon lossless source coding theorem is

based on the concept of block coding. To based on the concept of block coding. To illustrate this concept, we introduce a illustrate this concept, we introduce a special information source in which the special information source in which the alphabet consists of only two letters: alphabet consists of only two letters:

1.1. First-Order Block Code First-Order Block Code

A={a,b}A={a,b}

14

S

I

L

I

C

O

N

B1B1 P(B1)P(B1) CodewordCodeword

aa 0.50.5 00

BB 0.50.5 11

R=1 bit/characterR=1 bit/character

15

S

I

L

I

C

O

N

An example:-

Note that 24 bits are used to represent 24 Note that 24 bits are used to represent 24 characters --- an average of 1 characters --- an average of 1 bit/character. bit/character.

16

S

I

L

I

C

O

N

Second-Order Block Code :- Second-Order Block Code :- Pairs of Pairs of characters are mapped to either one, two, or three characters are mapped to either one, two, or three bits. bits.

17

S

I

L

I

C

O

N

.

..B2B2 P(B2)P(B2) CodewordCodeword

aaaa 0.450.45 00

bbbb 0.450.45 1010

abab 0.050.05 110110

baba 0.050.05 111111

R=0.825bits/characterR=0.825bits/character

18

S

I

L

I

C

O

N

An example:

Note that 20 bits are used to represent 24 Note that 20 bits are used to represent 24 characters --- an average of 0.83 characters --- an average of 0.83 bits/character. bits/character.

19

S

I

L

I

C

O

N

Third-Order Block Code: -Third-Order Block Code: -Triplets of Triplets of characters are mapped to bit sequence of lengths one characters are mapped to bit sequence of lengths one through six. through six.

20

S

I

L

I

C

O

N

..

B3B3 P(B3)P(B3) CodewordCodeword

aaaaaa 0.4050.405 00

bbbbbb 0.4050.405 0101

aabaab 0.4050.405 11001100

abbabb 0.4050.405 11011101

bbabba 0.4050.405 11101110

baabaa 0.4050.405 1111011110

abaaba 0.0050.005 111110111110

R=0.68 R=0.68

Bits/charactersBits/characters

21

S

I

L

I

C

O

N

An example: An example:

Note that 17 bits are used to represent 24 Note that 17 bits are used to represent 24 characters --- an average of 0.71 characters --- an average of 0.71 bits/character. bits/character.

22

S

I

L

I

C

O

N

HUFFMAN CODING Suppose messages are made of letters a, b, c, d, and e, Suppose messages are made of letters a, b, c, d, and e,

which appear with probabilities .12, .4, .15, .08, and .25, which appear with probabilities .12, .4, .15, .08, and .25, respectively.respectively.

We wish to encode each character into a sequence of 0’s We wish to encode each character into a sequence of 0’s and 1’s so that no code for a character is the and 1’s so that no code for a character is the prefixprefix for for another.another.

Answer (using Huffman’s algorithm given on the next Answer (using Huffman’s algorithm given on the next slide): a=1111, b=0, c=110, d=1110, e=10.slide): a=1111, b=0, c=110, d=1110, e=10.

23

S

I

L

I

C

O

N

HUFFMAN CODING ExampleExample

n = 5n = 5, , w[0:4] = [2, 5, 4, 7, 9].w[0:4] = [2, 5, 4, 7, 9].

92 5 4 7 9

24

S

I

L

I

C

O

N


95 75 7 9

2

n = 5n = 5, , w[0:4] = [2, 5, 4, 7, 9].w[0:4] = [2, 5, 4, 7, 9].

4

6

25

S

I

L

I

C

O

N

HUFFMAN CODING

EXAMPLEEXAMPLE

5

n = 5n = 5, , w[0:4] = [2, 5, 4, 7, 9].w[0:4] = [2, 5, 4, 7, 9].

2 4

6

11

7 9

16

26

S

I

L

I

C

O

N


5

n = 5n = 5, , w[0:4] = [2, 5, 4, 7, 9].w[0:4] = [2, 5, 4, 7, 9]. 2=0102=010 5=005=00 4=0114=011 7=107=10 9=119=11

2 4

6

11

7 9

16

2700

0

0

0

1

1

1 1

27

S

I

L

I

C

O

N

LZ-77 ENCODING

Good as they are, Huffman and arithmetic Good as they are, Huffman and arithmetic coding are not perfect for encoding text coding are not perfect for encoding text because they don't capture the higher-order because they don't capture the higher-order relationships between words and phrases. relationships between words and phrases. There is a simple, clever, and effective There is a simple, clever, and effective approach to compressing text known as approach to compressing text known as "LZ-77", which uses the redundant nature "LZ-77", which uses the redundant nature of text to provide compression. of text to provide compression.

28

S

I

L

I

C

O

N

For an example, consider the phrase: For an example, consider the phrase: the_rain_in_Spain_falls_mainly_in_the_the_rain_in_Spain_falls_mainly_in_the_plainplain

-- where the underscores ("_") indicate -- where the underscores ("_") indicate spaces. This uncompressed message is 43 spaces. This uncompressed message is 43 bytes, or 344 bits, long. bytes, or 344 bits, long.

29

S

I

L

I

C

O

N

the_rain_in_Spain_falls_mainly_in_the_plain

At first, LZ-77 simply outputs uncompressed At first, LZ-77 simply outputs uncompressed characters, since there are no previous occurrences characters, since there are no previous occurrences of any strings to refer back to. In our example, of any strings to refer back to. In our example, these characters will not be compressed: these characters will not be compressed:

1- the_rain_1- the_rain_ The next chunk of the message: The next chunk of the message: in_in_ -- has occurred earlier in the message, and can -- has occurred earlier in the message, and can

be represented as a pointer back to that earlier text, be represented as a pointer back to that earlier text, along with a length field. This gives: along with a length field. This gives:

2-the_rain_<3,3>2-the_rain_<3,3>

30

S

I

L

I

C

O

N


-- which has to be output uncompressed: -- which has to be output uncompressed: 3- the_rain_<3,3>Sp3- the_rain_<3,3>Sp However, the characters However, the characters

"ain_" have already been sent, so they are encoded "ain_" have already been sent, so they are encoded with a pointer: with a pointer:

4- the_rain_<3,3>Sp<9,4>4- the_rain_<3,3>Sp<9,4>The characters "falls_m" are output uncompressed, The characters "falls_m" are output uncompressed,

but "ain" has been used before in "rain" and but "ain" has been used before in "rain" and "Spain", so once again it is encoded with a "Spain", so once again it is encoded with a pointer:pointer:

5- the_rain_<3,3>Sp<9,4>falls _m<11,3>5- the_rain_<3,3>Sp<9,4>falls _m<11,3>

31

S

I

L

I

C

O

N


6- 6- the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3> the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3> <34,4> <34,4>

7- the_rain_in_Spain_falls_mainly_in_the_plain 7- the_rain_in_Spain_falls_mainly_in_the_plain

FINAL STEPFINAL STEP

the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>pl<the_rain_<3,3>Sp<9,4>falls_m<11,3>ly_<16,3><34,4>pl<15,3>15,3>

So total byte acquire this above text is 23So total byte acquire this above text is 23

Actual is 43Actual is 43

32

S

I

L

I

C

O

N

ARITHMATIC CODEIND

Huffman coding looks pretty slick, and it is, but Huffman coding looks pretty slick, and it is, but there's a way to improve on it, known as there's a way to improve on it, known as "arithmetic coding". The idea is subtle and best "arithmetic coding". The idea is subtle and best explained by example. explained by example.

Suppose we have a message that only contains the Suppose we have a message that only contains the characters A, B, and C, with the following characters A, B, and C, with the following frequencies, expressed as fractions: frequencies, expressed as fractions:

A: 0.5 B: 0.2 C: 0.3A: 0.5 B: 0.2 C: 0.3

33

S

I

L

I

C

O

N

letter probability interval binary fractionletter probability interval binary fraction

____ _________ ______ _______ ____ _________ ______ _______

C: 0.3 0.0 : 0.3 0C: 0.3 0.0 : 0.3 0

B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375

A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5

34

S

I

L

I

C

O

N

Irreversible Compression

Irreversible CompressionIrreversible Compression is based on the assumption is based on the assumption that some information can be sacrificed. [Irreversible that some information can be sacrificed. [Irreversible compression is also called compression is also called Entropy ReductionEntropy Reduction].].

Example: Shrinking a raster image from 400-by-400 Example: Shrinking a raster image from 400-by-400 pixels to 100-by-100 pixels. The new image contains pixels to 100-by-100 pixels. The new image contains 1 pixel for every 16 pixels in the original image.1 pixel for every 16 pixels in the original image.

There is usually no way to determine what the There is usually no way to determine what the original pixels were from the one new pixel.original pixels were from the one new pixel.

In data files, irreversible compression is seldom used. In data files, irreversible compression is seldom used. However, it is used in image and speech processing.However, it is used in image and speech processing.

35

S

I

L

I

C

O

N

LOSSY COMPRESSION Data is lost, but not too much:Data is lost, but not too much: Audio.Audio. Video.Video. Still images, medical images, photographs.Still images, medical images, photographs. Compression ratios of 10:1 often yield quiteCompression ratios of 10:1 often yield quite Major techniques include:Major techniques include: Vector Quantization.Vector Quantization. Block transforms.Block transforms. Standards – JPEG, JPEG 2000, MPEG (1, 2, 4, 7).Standards – JPEG, JPEG 2000, MPEG (1, 2, 4, 7).

36

S

I

L

I

C

O

N

IMAGE COMPRESSION

a) 24-bit true colour bitmap (253,014 bytes)

b) 60% image quality (5,599 bytes)

37

S

I

L

I

C

O

N

DISADVANTAGES

Some technique are there by which data can Some technique are there by which data can compress efficiently. But there is a chance compress efficiently. But there is a chance of losses data. of losses data.

38

S

I

L

I

C

O

N

CONCLUSION

From the above description ,there is no From the above description ,there is no algorithm has not been devloped.That is no algorithm has not been devloped.That is no such kind of algorithm which is applicable such kind of algorithm which is applicable in every data file.But this difficulties can be in every data file.But this difficulties can be handle by using Hybrid data compression.In handle by using Hybrid data compression.In this IT ara data compression is this IT ara data compression is essential.Even though some data will be essential.Even though some data will be loss. loss.

39

S

I

L

I

C

O

N

data compretion

Technology

i c o n data compression

data smaller

data means

i c o n question

i c o n preparation

i c o n main topic

excess data amounts

limits of compression