data compression intro

7/31/2019 Data Compression Intro

1/107

DATA COMPRESSIONLecture By

Kiran Kumar KVPESSE


2/107

Block diagram of DataCompression


3/107

INTRODUCTION TO LOSSLESSCOMPRESSION

Unit 1

Chapter 1


4/107

Preface

Introduction

Data

Need for Compression

Compression Techniques

Lossless and Lossy Compression

Performance Measure

Modeling and Coding

Problems


5/107

Introduction

The word data means "to give", hence "somethinggiven".

In geometry, mathematics, engineering, and so on, thetermsgiven and data are used interchangeably.

Also, data is a representation of a fact, figure, and idea.

In computer science: data are numbers, words, images,etc., accepted as they stand.


6/107

Data(Analog)


7/107

First Images sent over Atlantic usingsubmarine cable(telegraph) in 1920's

1964 Lunar Probe


8/107

Data (Analog)

ToThe PrincipalCollege

Respected Sir,Subject: Need a Heater in Class

Students pleasefill in other things.

Yours Sincerely

Faculty


9/107

Data(Digital World)

Raw Data => Digital Data011.0110101.


10/107

Difference b/w Data, Information andKnowledge?

Data is the lowest level ofabstraction, informationis the next level, and finally, knowledge is the highest levelamong all three.

Data on its own carries no meaning. In order fordata to become information, it must be interpreted and takeon a meaning.

For example, the height of Mt. Everest is generallyconsidered as "data", a book on Mt. Everest geological

characteristics may be considered as "information", and areport containing practical information on the best way toreach Mt. Everest's peak may be considered as "knowledge".


11/107

Compression

What is the need of compression?

What are the different kinds ofCompression?

Which is the better one ?

Which technique is used more often?

What the use of combining both the

technique?


12/107

What is the need for compression?


13/107


Weather Forecasting

Internet data

Broadband


14/107


Planning Cities


15/107

COMPRESSION TECHNIQUES

Introduction to Lossless Compression


16/107

Different kinds of compression

Loss-less compression

Compressed data can be reconstructedback to the exact original data itself.

Lossy Compression

Compressed data cannot bereconstructed back to the exact originaldata.


17/107

Loss-less Compression

Involve no loss of information.

Area: Text Compression.

Reconstructed text is identical to the original.

Do not send money Do now send money

Other Areas: Radiology, Satellite Imagery.

Main Advantage? Zero Distortion

Main Disadvantage?

Amount of Compression is less when compared to

lossy compression.


18/107

Lossy Compression

Disadvantage:

Data that have been compressed using lossy

techniques generally cannot be recovered or

reconstructed exactly.(Involves some loss of

information) Advantage:

Much higher compression ratios.

Areas

Audio and Video Compression.

(MP3, MPEG, JPEG)


19/107

MEASURE OF PERFORMANCE


How do we measure or


20/107

How do we measure orquantify Compression

performance?1. Based on Relative complexity of the algorithm?

2. Memory required to implement the algorithm.

3. How fast the algorithm performs on a givenmachine, (Secondary).

or

1. The amount of compression?2. How closely the reconstruction resembles the

original. (Primary).


21/107

Compression Ratio

Mostly widely used measure to compute data

compressed is -- Compression Ratio.

Ratio of the number of bits required to represent the

data before compression to the number of bitsrequired to represent the data after compression.

Example : Suppose storing an image made up of asquare array of 256 X 256 pixels requires 65,536

bytes. The image is compressed and the compressedversion requires 16,384 bytes.

Compression Ratio = 4:1


22/107

Another Measure

Rate: It is the average number of bits required to

represent a single sample.

Consider the last example, 256*256 original imagecontains 65536 bytes. Hence each pixel contains 1

byte or 8 bits per pixel(sample).

Now the compressed image contains 16384 bytes. Somany bits does each pixel contain?

The rate is 2 bits/pixel.

Is the above 2 Measures fine for lossy compression?


23/107

Distortion In lossy compression, the reconstruction differs from the

original data.

In order to determine the efficiency of a compression

algorithm, we have to quantify/measure the difference.

The difference between the original and thereconstruction is called the distortion.

Lossy techniques are generally used for the compression of datathat originate as analog signals, such as speech and video.

For comparison of speech and video, the final arbiter/judge of quality is

human.(behavioral analysis)

Because human responses are difficult to modelmathematically, many approximate measures of distortion

are used to determine thefidelity/quality of the

reconstructed waveforms.


24/107

MODELING AND CODING



25/107

Modeling and Coding

1) Compression scheme can be either Loss-lessor Lossy, based on the requirements/Appli.

2) Exact compression scheme depends on

different factors. But the main factor itdepends, is based on the characteristics of thedata.

3) Eg. Technique used for compression of a text

may not work well for compressing images.

The best approach for a given application,largely depends on the redundancies inherent

in the data.


26/107

Modeling and Coding

Redundancies

Redundant means that is not needed.

Or that can be omitted without any loss, or

significance.

Example: If we take an portrait image, most of thebackground is same and all of it need not beencoded at all.

The approach may work for one kind ofdata, but may not work for other kind ofdata ( a landscape or group photo).


27/107

Modeling and Coding

The development of data compressionalgorithms for a variety of data can be dividedinto two phases.

Modeling: Extract information about anyredundancy present in the data and model it.

Coding: Description of the model and a"description" of how the data differ from the

model are encoded (binary alphabet).

The difference between the data and themodel is often referred to as the residual.


28/107

DATA MODELING EXAMPLESIntroduction to Lossless Compression


29/107

Example 1Q. Consider the following sequence of numbers X= {x1,x2,

x3,...}: 9 11 11 11 14 13 15 17 16 17 20 21. How many bitsare required to store or transmit every sample?

Ans. 5 bits/sample or by exploiting the structure of data.

1) Model the data

Structure of a straight line.

Y=mX'+c or

Y=X'+8 X'={1 2 ......}

2) Residue

Difference b/w Model and Data

e=X-X'.

0 1 0 - 1 1 - 1 0 1 - 1 - 1 1 1


30/107

Example 1

The residual sequence consists of only three numbers {-1,0, 1}. Assign a code of00 to -1, 01 to 0 & 10 to

1, we need to use 2 bits to represent each element ofthe residual sequence.

Therefore, we can obtain compression by transmittingthe parameters of the model and the residual sequence.

Lossy if only model is transmitted. Lossless if bothresidue/difference and parameter are transmitted.

Q. Model the given data for compression.

{ 6 8 10 10 12 11 12 15 16}

{ 5 6 9 10 11 13 17 19 20}


31/107

Example 2

Q. Find the structure present in this data sequence.

27 28 29 28 26 27 29 28 30 32 34 36 38

I. Ans. No structure is found.

Hence

II. Check for closeness of the

values.

III. 27 1 1 -1 -2 1 2 -1 2 2 2 2

IV. Send First value. Thensend rest of residue .

V. Are the bits/sample reduced?

VI. Decoder adds current value to the previous decoded value to

reconstruct back the original sequence.


32/107

Note

Techniques that use the past values of a sequencetopredictthe current value and then encode theerror in prediction, or residual, are called

predictive coding schemes.

Note: Assuming both encoder and decoder knowthe model being used, we would still have to

send the value of the first element of thesequence.


33/107

Example 3

Suppose we have the following sequence:

aba ray a ranba rrayb ranbfa rbfaa rbfaaa rbaway

To repesent above 8 Symbols, 3 bits/symbol are required.

Suppose if we assign 1 bit to the symbol that occurs most often.

As there are 41 symbols in

the sequence, this works out

to approximately 2.58 bits per

symbol. This means we have

obtained a compression ratio of

1.16:1. (Huffman coding)

Dictionary Compression Scheme. (Letter/Words repeat)


34/107

Note

There will be situations in which it is easier to takeadvantage of the structure if we decompose the datainto a number of components. We can then study eachcomponent separately and use a model appropriate to

that component.There are a number of different ways to characterize

data. Different characterizations will lead to different

compression schemes.

We can compress something with products from onevendor and reconstruct it using the products of a

different vendor. International standards organizationshave standards for various compression applications.


35/107

MATHEMATICAL PRELIMINARIESFOR LOSS-LESS COMPRESSION

Unit-1

Chapter 2

Loss-Less Compression


36/107

Overview

This chapter deals with Lossless schememathematical framework.

Starting with Information Theory

Basic Probability Concept.

Based on the above mathematicalconcepts, modeling of the data.

Introduction to Information


37/107

Introduction to InformationTheory

Quantitative measure of Information.

Father of Information Theory: Claude ElwoodShannon, an electrical Engineer at Bell Labs.

He defined a quantity called Self Information. Example: Given an Random Experiment, if A is an Event

occurring in a set of outcomes. The self information associated

with A is given by

Where i(A) is the self information.

P(A) is the probability of the event A.

Introduction to Information


38/107

Introduction to InformationTheory

Log(1) = 0 and -log(x) increases if x decreases.

Probability of an event is low, information associated

with it is high and vice-versa.

Another Property: Information obtained from theoccurrence of 2 independent events is the sumof information obtained form occurrence of theindividual events.

Suppose A & B are 2 independent events. Theself information associated with the occurrenceof both event A & B is


39/107

Intro. To Information Theory


40/107


I T I f i Th


41/107


E t


42/107

Entropy

Suppose we have independent events Ai, whichare the outcomes of some experiment S.

Then the average self information associatedwith the random experiment is

This quantity is called entropy associated with

the experiment.(Shannon) Note: Entropy is also the measure of average

no. of binary symbols needed to code the output.

N t


43/107

Note

Most of the experiment results that we see in thissubject are independent and identicaldistributed(iid).

Above Entropy equation holds good only if the

experiment is iid.

Theorem: Shannon showed that the maximum

average no. of bits that a loss-less compression

scheme can achieve will be equal to the entropy of the

source.

The estimate of the entropy depends on ourassumptions about the structure of the sourcesequence.

E l 4


44/107

Example 4

Q. Consider the sequence

1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10

The probability of occurrence of each element is

P(l) = P(6) = P(7) = P(10) = 1/16

P(2) = P(3) = P(4) = P(5) = P(8) = P(9) =2/16

Assuming the sequence is iid, the entropy for this sequence is first-orderentropy calculated as

Entropy of this source is 3.25bits.

Hence w.r.t. Shannon the maximum no. of bits required tocode the sample is 3.25 bits/sample.

E l 4


45/107

Example 4 Step 2 : Model the given data to remove

redundancy. Solution: there is a sample-to-sample correlation between the

samples and we remove the correlation by taking differences

of neighboring sample values.

1 1 1 - 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1

The sequence is constructed using two values 1 and -1. P(1) =

13/16 and P(-1)=3/16.

The entropy in this case is 0.70 bits per symbol.

Knowing only these sequence is not enough to reconstruct the

original sequence. We must know the process by which this

sequence was generated from the original. Process depends on

the assumption about input data structure.

Assumption = Model

Note


46/107

Note

If the parameter rn does not changes with n, then it is called static model.

A model whose parameters does not

change or adapt with n to the changingcharacteristics of the data is called anadaptive model.

Basically, we see that knowing something

about the structure of the data can help to"reduce the entropy."

Structure


47/107

Structure

Consider the following sequence:1 2 1 2 3 3 3 3 1 2 3 3 3 3 1 2 3 3 12

Obviously, there is some structure to this data. However, if we

look at it one symbol at a time, the structure is difficult to

extract. Consider the probabilities: P{1) = P{2) = 1/4 , and p(3)= 1/2. The entropy is 1.5 bits/symbol. This particular sequence

consists of 20 symbols; therefore, the total number of bits

required to represent this sequence is 30.

Now let's take the same sequence and look at it in blocks of two.

Obviously, there are only two symbols, 1 2, and 3 3. Theprobabilities are P(l 2) = 1/2, P(3 3) = 1/2, and the entropy is 1

bit/symbol.

As there are 10 such symbols in the sequence, we need a total of

10 bits to represent the entire sequencea reduction of a factorof three.


48/107

Derivation of Average Information

Not in Syllabus

Models


49/107

Models

Good models for sources lead to more efficientcompression algorithms.

In general, in order to develop techniques that manipulatedata using mathematical operations, we need to have a

Mathematical model for the data. There are several approaches to build Mathematical

model.

Physical Model

Probability Model

Markov Model

Composite Source Model

Physical Model


50/107

Physical Model

If we know something about the physics of the data generationprocess, we can use that information to construct a model. For

example,

In speech-related applications, knowledge about the physics of

speech production can be used to construct a mathematical

model for the sampled speech process. Sampled speech can

then be encoded using this model.

If residential electrical meter readings at hourly intervals were to

be coded, knowledge about the living habits of the populace

could be used to determine when electricity usage would behigh and when the usage would be low. Then instead of the

actual readings, the difference (residual) between the actual

readings and those predicted by the model could be coded.

Physical Model


51/107

Physical Model

Disadvantages

In general, however, the physics of data

generation is simply too complicated to understand, letalone use to develop a model.

Since the physics of the problem is too complicated,currently we can a model based on empirical

observation of the statistics in data.

Probability Model


52/107

Probability Model

The simplest Mathematical model for the source is toassume that all the events are independent andidentically distributed(IID). Hence the nameignorance model.

Used when we dont know anything about the source.

Next lets assume the events are independent but notequally distributed. Using Entropy equation we can

find the entropy.

For a source that generates letters from an alphabet A={a1,a2 , . . . , aM}, can be represented by a

probability modelP = {P(a1), P(a2) ........P(aM)}

Probability Model


53/107

Probability Model

Next if we discard the assumption ofindependence also, we come up with a betterdata compression scheme but we have to definethe dependency of data sequence on each other.

One of the most popular ways of representingdependence in the data is through the use ofMarkov models, named after the Russian

mathematician Andrei Andrevich Markov(1856-1922).

Markov Models


54/107

Markov Models

For models used in loss-less compression, we use aspecific type of Markov process called a discrete time

Markov chain.

Let{Xn} be a sequence, which is said to follow a Kth-

order morkov model if

P(Xn|Xn-1,........,Xn-k) = P(Xn|Xn-1,........,Xn-k,.........)

The knowledge of the past k symbols is equivalent to

the knowledge of the entire past history of the process. The values taken on by the set process {Xn-1 . . . . . .

,........,Xn-k} are called the states of the process.

Markov Models


55/107

Markov Models

The most commonly used Markov model is the first-order Markov model, for which

P(Xn|Xn-1) = P(Xn|Xn-1,Xn-2,Xn-3.......,)

Markov chain property: probability of each subsequent statedepends only on what was the previous state:

The above equations indicate the existence of dependencebetween samples. However, they do not describe the form of

the dependence.

We can develop different first-order Markov models depending

on our assumption about the form of the dependence betweensamples.

To define Markov model, the following probabilities have to bespecified: transition probabilities P{X2|X1} and initial

probabilities P{X1}.

Markov Models


56/107

Markov Models

If we assumed that the dependence was introduced in a linearmanner, we could view the data sequence as the output of a

linear filter driven by white noise.

The output of such a filter can be given by the difference equation

En is the white noise.

This model is often used when developing coding algorithms for

speech and images.

The Markov model does not need the assumption of linearity

Markov Model Example


57/107


For example, consider a binary image. The image has only twotypes of pixels, white pixels and black pixels.

Q. Based on the current pixel appearance, can we predict the

appearance the next observation?

Ans. Yes, we can model the pixel process as a discrete timeMarkov chain.

Define two states Swand Sb.{Sw} would correspond to the case

where the current pixel is a white pixel, and {Sb} corresponds

to the case where the current pixel is a black pixel).

We define the transition probabilities P{w/b)and P(b/w), and the

probability of being in each state P(Sw) and P{Sb).

The Markov model can then be represented by the state diagram

shown in Figure.

Markov Model


58/107

Markov Model

The entropy of a finite state process with states S, issimply the average value of the entropy at each state:

Example of Markov Model


59/107

Rain Dry

0.70.3

0.2 0.8

Two states : Rain and Dry.

Transition probabilities: P(Rain|Rain)=0.3 ,P(Dry|Rain)=0.7 , P(Rain|Dry)=0.2, P(Dry|Dry)=0.8 Initial probabilities: say P(Rain)=0.4 , P(Dry)=0.6 .

Example of Markov Model



60/107


Markov Model in textC i


61/107

Compression

Probability of the next letter is heavily influencedby the preceding letter in English.

Current Text Compression literature, the k-orderMarkov models are widely known as finite

context models, the word context is being usedfor state. Example:

Consider the word preceding .

Suppose we have already processedprecedin

and are goingto encode the next letter.

If no account of context is taken intoconsideration and we treat the letter as asurprise, the probability of the letter goccurring is

relatively low

Example


62/107

p

If we use a first-order Markov model (i.e. we look atn probability model), we can see that the probabilityof g would increase substantially.

As we increase the context size (go from n to in to

din and so on), the probability of the alphabetbecomes more and entropy decreases.

Shannon used a second-order model for Englishtext consisting of the 26 letters and one space toobtain an entropy of 3.1 bits/letter . Using a model

where the output symbols were words rather thanletters brought down the entropy to 2.4 bits/letter.

Note: The longer the context, the better its predictivevalue.

Markov Model in TextCompression


63/107

Compression

Disadvantage:To store the probability model with respect to all

contexts of a given length, the number of contextswould grow exponentially with the length of context.

Source imposes some structure on its output, many of

these contexts may correspond to strings that wouldnever occur in practice.

Different sources may have different repeating patterns.

Solution: PPM(Prediction& Partial March) Algorithm.Context is found for the symbol of non-zero/max

probability first(Encoding). Zero Prob symbols are

replaced with escape symbols and computed

Composite Source Model


64/107

p

In many applications, it is noteasy to use a single model to

describe the source.

In such cases,we can define a

composite source, which can

be viewed as a combinationor composition of several

sources, with only one source

being active at any given

time.

Source Si will have its own

Model Mi based on

probability Pi.

Coding


65/107

g

Coding: Assignment of binary sequences (0s or 1s) toelements or symbols.

The set of binary sequences is called a code,and the individualmembers of the set are calledcodewords.

Code ( 100101100110010101) Codewords ( a -> 001, b -> 010)

An alphabet is a collection of symbols called letters.For example,the alphabet used in writing most books consists of the 26 lowercase

letters, 26 uppercase letters, and a variety of punctuation marks.

In the terminology used in this book, a comma is a letter.

The ASCII code for the letter a is 1000011, the letter A is coded as1000001, and the letter "," is coded as 0011010.

Notice that the ASCII code uses the same number of bits to representeach symbol. Such a code is called afixed-length code.

Coding


66/107

g

To reduce the number of bits required to represent differentmessages, we need to use a different number of bits to

represent different symbols.

If we use fewer bits to represent symbols that occur more often,on the average we would use fewer bits per symbol. The

average number of bits per symbol is often called the rate ofthe code.

Example: Morse Code, Huffman code.

letters that occur more frequently are shorter than for letters

that occur less frequently.

The codeword forE is 1 bit

while the codeword for Z is 7 bits.

Uniquely Decodable Codes


67/107

q y

Average length of the code, is not only thecriteria for good code.

Example Suppose our source alphabet consists of fourletters a1, a2, a3 & a4 with probabilities P(a1) = 1/2 ,

P(a2) = 1/4, P(a3) = P(a4) = 1/8. The entropy for thissource is 1.75 bits/symbol.

where n(ai) is the number of bits in the codeword for

letter ai and the average length is given in bits/symbol.



68/107

From the table, w.r.t average lengthCode1 appears to be the best code.

However code should have the ability totransfer information in an unambiguousway.



69/107

Code 1 Both a1 and a2 have been assigned the codeword 0. When

a 0 is received, there is no way to know whether an a1was transmitted or an a2. Hence we would like each

symbol to be assigned a unique codeword.

Uniquely Decodable Code


70/107

Code 2 seems to have no problem with ambiguity. However if we encode {a2 a1 a1}. Binary string

would be 100.

However 100 can be decoded as {a2 a1 a1} and {a2

a3}

Meaning original sequence cannot be recovered withcertainty.

There is no Unique decodability. (Not Desirable)



71/107

How about Code 3? First 3 codewords end with 0. 0 denotes

termination of codeword.

And a4 codeword is 3-bit 1's. Which iseasily decodable.

Code 3


72/107

Code 3? Notice that the first three codewords all end in a 0. In fact,

a 0 always denotes the termination of a codeword.

The final codeword contains no 0s and is 3 bits long.

Because all other codewords have fewer than three 1s andterminate in a 0, the only way we can get three as in a rowis as a code for a4.

The decoding rule is simple. Accumulate bits until you get

a 0 or until you have three 1s. There is no ambiguity in this rule, and it is reasonably

easy to see that this code is uniquely decodable.



73/107

Code 4. Each codeword starts with a 0, and the only time we see a

0 is in the beginning of a codeword.

Decoding rule is to accumulate bits until you see a 0. The

bit before 0 is the last bit of the previous codeword.

Code 4


74/107

Difference between Code 3 and Code 4 is that the Code 3,decoder knows the moment a code is complete. In Code 4,

we have to wait till the beginning of the next codewordbefore we know that the current codeword is complete.

Because of this property, Code 3 is called aninstantaneous

code and code 4 is near instantaneous code.

Q). Is code 4 Uniquely Decodable?

Ans. Decode the string 011111111111111111.



75/107

Instantaneous and near-Instantaneous

Decode 011111111111111111 from Code 5.

From the above string

First codeword can be

either 0 (a1) or 01(a2). Assuming 1st codeword is a1,

after decoding other 8 codewords as a3's. We will be leftwith a single (dangling) 1.

If we assume 1st codeword as a2, we will be able todecode next 16 codewords as 8 a3's.

The string can be uniquely decoded. In fact. Code 5,

while it is certainly not instantaneous, but is uniquely

decodable in one way and not unique in other way



76/107

Decode the 01010101010101010 from Code 6. Step 1: Decode a1 and 8 a3's.

Step 2: Decode 8 a2's and one a1.

Not Uniquely Decodable.

Even with looking at these small codes, it is not

immediately evident whether the code is uniquely

decodable or not.For Lager codes?

Hence a systematic procedure should be followed to testfor unique decodability.


77/107

A Test for Unique Decodability is

not there in portion

Test for Unique Decodability:Example 1


78/107

p

Consider Code 5. First list the codewords

{0,01,11}

The codeword 0 is a prefix for thecodeword 01.

Hence the dangling suffix is 1.

There are no other pairs for which oneelement of the pair is the prefix of theother.

Example 1


79/107

Let us augment (add) the codeword listwith the dangling suffix.

{0,01,11,1}

Comparing the elements of this list, we find0 is a prefix of 01 with a dangling suffix of1. But we have already included 1 in ourlist.

Also, 1 is a prefix of 11. This gives us adangling suffix of 1, which is already in thelist.

Example 1


80/107

There are no other pairs that wouldgenerate a dangling suffix, so we cannotaugment the list any further.

Therefore, Code 5 is uniquely decodable.

Test for Unique Decodability:Example 2


81/107

p

Consider Code 6. First list the codewords {0,01,10}

The codeword 0 is a prefix for the

codeword 01. The dangling suffix is 1. There are no other pairs for which one

element of the pair is the prefix of theother.

Augmenting the codeword list with 1, weobtain the list

{0,01,10,1}

Example 2


82/107

In this list, 1 is a prefix for 10. The danglingsuffix for this pair is 0, which is thecodeword for a1.

Therefore, Code 6 is not uniquely

decodable.

Prefix Codes


83/107

The test for unique Decodability requires examining thedangling suffixes.

If the dangling suffix is itself a codeword, then the code isnot uniquely decodable.

One type of code in which we will never face thepossibility of a dangling suffix being a codeword is a code

in which no codeword is a prefix of the other

A code in which no codeword is a prefix to another

codeword is called aprefix code. A simple way to check if a code is a prefix code is to draw

the rooted binary tree corresponding to the code.

Prefix Codes


84/107

Draw a tree that starts from a single node(the root node) &has possible 2 branches at each node.

One of the branch corresponds to 1 and the other 0.

The Convention followed is that root node at the top, left

branch is 0 and the right branch is 1. Using convention, draw binary tree for Code 2, 3 & 4.

Prefix Codes


85/107


86/107


87/107

The Kraft-McMillan Inequality

Not in Syllabus

Algorithmic information Theory


88/107

Information theory dealt with data andcorresponding from it.

While Algorithmic Information Theory deals with

program you code for compressing the data.

At the heart of algorithmic information theory is ameasure called Kolmogorov complexity.

The Kolmogorov complexity k(x)) of a sequencex isthe size of the program needed to generatex.

Size: includesall the needed i/p's for the program are

present.

Algorithmic information Theory

If f ll hi hl ibl


89/107

Ifxwas a sequence of all ones, a highly compressible

sequence, the program would simply be a printstatement in a loop.

On the other extreme, ifx were a random sequencewith no structure then the only program that could

generate it would contain the sequence itself.

The size of the program, would be slightly larger thanthe sequence itself.

Thus, there is a clear correspondence between thesize of the smallest program that can generate asequence and the amount of compression that can be

obtained.

Lo er bo nd ncertain and is not practicall sed


90/107

Huffman Coding

Huffman CodingOverview

The Huffman Coding Algorithm


91/107

The Huffman Coding Algorithm

Developer: David Huffman class assignment;information theory, taught by Robert Fano at MIT.

These codes are prefix codes and are optimum for agiven model (set of probabilities).

The Huffman procedure is based on twoobservations regarding optimum prefix codes.

1. In an optimum code, symbols that occurmore frequently (have a higher probability

of occurrence) will have shorter codewordsthan symbols that occur less frequently.

2. In an optimum code, the two symbolsthat occur least frequently will have the

same len th

Design of Huffman Code


92/107

Let us design a Huffman code for a source that puts outletters from an alphabetA = {a1, a2, a3, a4, a5} with

P(a1) = P(a3) = 0.2, P(a2) = 0.4, and P(a4) = P(a5) =0.1.

First find first order Entropy?

Step1: Sort the letters in Descending Probability order.

Huffman Coding AlgorithmExample


93/107

Step 3: Find the average length.

L = . 4 x l + . 2 x 2 + . 2 x 3 + . l x 4 + . l x 4 = 2.2

bits/symbol.

Step 4: Calculate Redundancy?

Example Step 3: Find the average length.


94/107

L = . 4 x l + . 2 x 2 + . 2 x 3 + . l x 4 + . l x 4 = 2.2bits/symbol.

Step 4: Calculate Redundancy?

Step 5: Binary Huffman Tree

Example 2


95/107

Transmit 28 data samples using HuffmanCode.

1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 55 5 6 6 7

Minimum Variance HuffmanCoding


96/107

Minimum Variance HuffmanCoding


97/107

L = . 4 x 2 + . 2 x 2 + . 2 x 2 + . 1 x 3 + . 1 x 3 = 2.2

bits/symbol.

The two codes are identical in terms of their redundancy.However, the variance of the length of the codewords

is significantly different.

Minimum Variance HuffmanCodes

R b th t i li ti lth h i ht b


98/107

Remember that in many applications, although you might be

using a variable-length code, the available transmissionrate is generally fixed.

For example, if we were going to transmit symbols from thealphabet we have been using at 10,000 symbols per

second, we might ask for transmission capacity of 22,000bits per second. This means that during each second thechannel expects to receive 22,000 bits, no more and no

less. As the bit generation rate will vary around 22,000bits per second, the output of the source coder is generally

fed into a buffer. The purpose of the buffer is to smoothout the variations in the bit generation rate.

However, the buffer has to be of finite size, and the greater

the variance in the codewords, the more difficult the buffer

Minimum variance Huffmancoding


99/107

Suppose that the source generates a string ofa4 and a5 forseveral seconds. If we are using the first code, this means

that we will be generating bits at a rate of 40,000 bits persecond. For each second, the buffer has to store 18,000

bits

On the other hand, if we use the second code, we would begenerating 30,000 bits per second, and the buffer would

have to store 8000 bits for every second.

If we have a string of a2'sinstead of a string of a4and a5, the

first code would result in the generation of 10,000 bits persecond. Deficit of 12000 bits per second.

Second code would lead to deficit of 2000 bits per second.

So which do we select?

Application of Huffman coding

H ff di i f d i j i i h h


100/107

Huffman coding is often used in conjunction with othercoding techniques in

Loss-less Image Compression

Text Compression

Audio Compression

Loss-Less Image Compression


101/107

Monochrome Image Pixel(0-255)

Monochrome Images


102/107

Compression of test imagesusing Huffman Coding


103/107

Original (Uncompressed) test images are

represented using bits/pixel.

The image consists of 256 rows of 256 pixels, so theuncompressed representation uses 65,536 bytes.

Image Compression

From a visual inspection of the test images we can clearly


104/107

From a visual inspection of the test images, we can clearlysee that the pixels in an image are heavily correlated with

their neighbors.

We could represent this structure with the crude model

Xn = Xn-1 The residual would be the difference between

neighboring pixels.

Huffman Coding TextCompression


105/107

We encoded the earlier version of this chapter using Huffman codes that werecreated using the probabilities of occurrence obtained from the chapter. The file

size dropped from about 70,000 bytes to about 43,000 bytes with Huffman

di

Audio Compression


106/107


107/107

The End of Unit 1Any Thoughts , Doubts or Ideas |

data compression intro

Documents