1 introduction to information technology lecture 5 compression

36
1 Introduction to Information Technology LECTURE 5 COMPRESSION

Upload: virgil-hill

Post on 02-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

1

Introduction to Information Technology

LECTURE 5COMPRESSION

Page 2: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

2

Why Do We Need Compression?

Page 3: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

3

How Large is a Digitized Image File?

In-Class Example

Page 4: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

4

Downloading an Image File

In-Class Example

Page 5: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

5

Reducing the Size of an Image File

In-Class Example

Page 6: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

6

How Big is a Digital Video File?

Color Screen 512 x 512 pixels

3 bits per color per pixel = 9 bits/pixel

Scene changes 60 frames/second

512 x 512 x 9 x 60 x 3600 = 500 billion bits/hour

“The Godfather” requires 191 GB storage ????

COMPRESSION is a Critical Requirement

Page 7: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

7

COMPRESSION

Compression techniques can significantly reduce the bandwidth and memory required for sending, receiving, and storing data.

Most computers are equipped with modems that compress or decompress all information leaving or entering via the phone line.

With a mutually recognized system (e.g. WinZip) the amount of data can be significantly diminished.

Examples of compression techniques we’ll discuss: Compressing BINARY DATA STREAMS

Variable length coding (e.g. Huffman coding) Universal Coding (e.g. WinZip)

IMAGE-SPECIFIC COMPRESSION GIF and JPEG

VIDEO COMPRESSION AND MUSIC COMPRESSION MPEG and MP3

Page 8: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

8

WHY CAN WE COMPRESS INFORMATION?

Compression is possible because information usually contains redundancies, or information that is often repeated.

For example, two still images from a video sequence of images are often similar. This fact can be exploited by transmitting only the changes from one image to the next.

For example, a line of data often contains redundancies “Ask not what your country can do for you - ask what you

can do for your country.”

File compression programs remove this redundancy.

Page 9: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

9

Redundancy Enables Compression

• This quote from John F. Kennedy’s inaugural address contains 79 units.

• 61 letters + 16 spaces + 1 dash + 1 period.• Each requires one unit of memory. (1 byte)

• To reduce memory space, we look for redundancies.• “ask” appears two times• “what” appears two times• “your” appears two times• “country” appears two times• “can” appears two times• “do” appears two times• “for” appears two times• “you” appears two times

“Ask not what your country can do for you - ask what you can do for your country.”

Nearly half of the sentence is redundant.

Page 10: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

10

Text Files Contain High Redundancy

In English and other languages, words often appear together.

e.g The.. And..From..Of..Because.. Consequently, a large text file often can be reduced by

50% through compression algorithms.

Similarly, programming languages contain a high degree of redundancy.

A small number of commands are used over and over again.

Page 11: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

11

WHY ELSE CAN WE COMPRESS INFORMATION?

We can only hear certain frequencies Our eyesight can only resolve so much detail We can only process so much information at one time.

Page 12: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

12

WHY ELSE CAN WE COMPRESS INFORMATION?

Some characters occur more frequently than others. It’s possible to represent frequently occurring characters with

a smaller number of bits during transmission. This may be accomplished by a variable length code, as

opposed to a fixed length code like ASCII. An example of a simple variable length code is Morse Code.

“E” occurs more frequently than “Z” so we represent “E” with a shorter length code:

. = E - = T - - . . = Z - - . - = Q . = E - = T - - . . = Z - - . - = Q

Page 13: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

13

SOME BACKGROUND: INFORMATION THEORY

Variable length coding exploits the fact that some information occurs more frequently than others.

The mathematical theory behind this concept is known as: INFORMATION THEORY

Claude E. Shannon developed modern Information Theory at Bell Labs in 1948.

He saw the relationship between the probability of appearance of a transmitted signal and its information content.

This realization enabled the development of compression techniques.

Page 14: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

14

A LITTLE PROBABILITY

Shannon (and others) found that information can be related to probability. An event has a probability of 1 (or 100%) if we believe this event will

occur. An event has a probability of 0 (or 0%) if we believe this event will not

occur. The probability that an event will occur takes on values anywhere from 0

to 1. Consider a coin toss: heads or tails each has a probability of .50

In two tosses, the probability of tossing two heads is: 1/2 x 1/2 = 1/4 or .25

In three tosses, the probability of tossing all tails is: 1/2 x 1/2 x 1/2 = 1/8 or .125

We compute probability this way because the result of each toss is independent of the results of other tosses.

Page 15: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

15

ENTROPY CONCEPT If the probability of a binary event is .5 (like a coin), then on average,

you need one bit to represent the result of this event. As the probability of a binary event increases or decreases, the number

of bits you need, on average, to represent the result decreases The figure is expressing that unless an event is totally random, you can

convey the information of the event in fewer bits, on average, than it might first appear

Let’s do an example...

As part of information theory,

Shannon developed the concept of ENTROPY

Probability of an event

Bits

Page 16: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

16

EXAMPLE FROM TEXT

The probability of male patrons is .8 The probability of female patrons is .2

Assume for this example, groups of two enter the store. Calculate the probabilities of different pairings:

Event A, Male-Male. P(MM) = .8 x .8 = .64 Event B, Male-Female. P(MF) = .8 x .2 = .16 Event C, Female-Male. P(FM) = .2 x .8 = .16 Event D, Female-Female. P(FF) = .2 x .2 = .04

We could assign the longest codes to the most infrequent events while maintaining unique decodability.

A MEN’S SPECIALTY STORE

Page 17: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

17

Let’s assign a unique string of bits to each event based on the probability of that event occurring.

Event Name Code AMale-Male 0 BMale-Female 10 CFemale-Male 110 DFemale-Female 111

Given a received code of: 01010110100, determine the events:

The above example has used a variable length code.

EXAMPLE CONTINUED

A

MM

B

MF

B

MF

C

FM

B

MF

A

MM

Page 18: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

18

VARIABLE LENGTH CODING

Unlike fixed length codes like ASCII, variable length codes:

Assign the longest codes to the most infrequent events. Assign the shortest codes to the most frequent events.

Each code word must be uniquely identifiable regardless of length.

Examples of Variable Length Coding Morse Code Huffman Coding

Takes advantage of the probabilistic nature of information.

If we have total uncertainty about the information we are conveying, fixed length codes are preferred.

Page 19: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

19

MORSE CODE

Characters represented by patterns of dots and dashes. More frequently used letters use short code symbols. Short pauses are used to separate the letters. Represent “Hello” using Morse Code:

H . . . . E . L . - . . L . - . . O - - -

Hello . . . . . . - . . . - . . - - -

Page 20: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

20

HUFFMAN CODE

Creates a Binary Code Tree Nodes connected by

branches with leaves Top node – root Two branches from each

node

D

B

C

A

Start

Root Branches

Node

Leaves

0

0

0

1

1

1

The Huffman coding procedure finds the optimum, uniquely decodable, variable length code associated with a set of events, given their probabilities of occurrence.

Page 21: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

21

A 0B 10C 110D 111

Given the adjacent Huffman code tree, decode the following sequence: 11010001110

HUFFMAN CODING

D

B

C

A

Start

Root Branches

Node

Leaves

0

0

0

1

1

1110C

10B

0A

0A

111D

0A

Page 22: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

22

HUFFMAN CODE CONSTRUCTION First list all events in descending order of probability.

Pair the two events with lowest probabilities and add their probabilities.

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

0.15

Page 23: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

23

HUFFMAN CODE CONSTRUCTION Repeat for the pair with the next lowest probabilities.

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

0.150.25

Page 24: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

24

HUFFMAN CODE CONSTRUCTION Repeat for the pair with the next lowest probabilities.

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

0.150.25

0.4

Page 25: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

25

HUFFMAN CODE CONSTRUCTION Repeat for the pair with the next lowest probabilities.

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

0.150.25

0.40.6

Page 26: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

26

HUFFMAN CODE CONSTRUCTION Repeat for the last pair and add 0s to the left branches and 1s to

the right branches.

.3Event A

.3Event B

.13Event C

.12Event D

.1Event E

.05Event F

0.150.25

0.40.6

0

0

0

0 0

1

1

111

00 01 100 101 110 111

Page 27: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

27

QUESTION

Given the code we just constructed: Event A: 00 Event B: 01 Event C: 100 Event D: 101 Event E: 110 Event F: 111

How can you decode the string: 0000111010110001000000111? Starting from the leftmost bit, find the shortest bit pattern that

matches one of the codes in the list. The first bit is 0, but we don’t have an event represented by 0. We do have one represented by 00, which is event A. Continue applying this procedure:

00A

00A

111F

01B

01B

100C

01B

00A

00A

00A

111F

Page 28: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

28

In-Class Problem Construct a Huffman code tree for the following events:

Probability (Event A) = .5 Probability (Event B) = .3 Probability (Event C) = .1 Probability (Event D) = .1

Page 29: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

29

In Class Problem

Using the Huffman Code tree below, decode the

following sequence: 01110100

Event A

Event B

Event C

0 1

0 1

Page 30: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

30

UNIVERSAL CODING

Huffman has its limits You must know a priori the probability of the characters or

symbols you are encoding. What if a document is “one of a kind?”

Universal Coding schemes do not require a knowledge of the statistics of the events to be coded.

Universal Coding is based on the realization that any stream of data consists of some repetition.

Lempel-Ziv coding is one form of Universal Coding presented in the text.

Compression results from reusing frequently occurring strings. Works better for long data streams. Inefficient for short

strings. Used by WinZip to compress information.

Page 31: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

31

Lossless Versus Lossy Compression

LOSSLESS CODING: Every detail of the original data is restored upon decoding.

Examples of compression we’ve discussed thus far are “Lossless” Lossless approach absolutely essential for information like

financial or engineering data.

LOSSY CODING: Some information is lost. Lossy coding can be applied to data in which we can tolerate

some loss of information. Human vision can tolerate some loss of image sharpness

fax images, photographs, video clips Human hearing can tolerate some loss of fidelity in sound. Fidelity = faithfulness of our reproduction of an image or

sound after compression and decompression

Page 32: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

32

IMAGE COMPRESSION

Near Photographic Quality Image 1,280 Rows of 800 pixels each, with 24 bits of color

information per pixel Total = 24,576,000 bits

56 Kbps modem 56,000 bits/sec How long does it take

to download? 24,576,000/56,000 = 439 seconds/60 = 7.31 minutes

Obviously image compression is essential.

Page 33: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

33

IMAGES ARE WELL-SUITED FOR COMPRESSION

Images have more redundancy than other types of data. Images contain a large amount of structure. Human eye is very tolerant of approximation error.

2 types of image compression Lossless coding

Every detail of original data is restored upon decoding Examples – Run Length Encoding, JPEG, GIF

Lossy coding Portion of original data is lost but undetectable to

human eye Good for images and audio Examples - JPEG

Page 34: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

34

IMAGE COMPRESSION

JPEG -Joint Photographic Experts Group 29 distinct coding systems for compression, 2 for Lossless

compression. Lossless JPEG uses a technique called predictive coding to

attempt to identify pixels later in the image in terms of previous pixels in that same image.

Lossy JPEG consists of image simplification, removing image complexity at some loss of fidelity.

GIF – Graphics Interchange Format Developed by CompuServe. Lossless image compression system. Application of Lempel-Ziv-Welch (LZW)

The two compressed image formats most often encountered on the Web are JPEG and GIF.

Page 35: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

35

DIGITAL VIDEO COMPRESSION - MPEG

MPEG is a series of techniques for compressing streaming digital information.

DVDs use MPEG coding. MPEG achieves compression results on the order of 1/35 of

original. If we examine two still images from a video sequence of

images, we will almost always find that they are similar. This fact can be exploited by transmitting only the changes

from one image to the next. Many pixels will not change from one image to the next.

Called IMAGE DIFFERENCE CODING

Motion Picture Expert Group (MPEG) standard for video compression.

Page 36: 1 Introduction to Information Technology LECTURE 5 COMPRESSION

36

MPEG Audio Layer-3 (MP3) The MPEG Compression Standard includes a Specification for

Compressing Sound. Technical Name is MPEG Audio Layer-3. Acronym is MP3

CDs Store Music in Uncompressed Formats. Assume Music is sampled 44,100 times per second.

44,100 samples/second * 16 bits/sample * 2 channels = 1,411,200 bits per second

It would take a prohibitively long time to download a song over a 56 Kbps modem.

Compression is essential. MP3 reduces the file size by more than ten times. HOW?

PERCEPTUAL NOISE SHAPING. Example of Lossy Compression. NEW TOPIC: AUDIO AS INFORMATION..