3 mathematical priliminaries data compression

Mathematical Preliminaries

1

The development of data compression algorithmsfor a variety of data can be divided into two phases.

Modeling

Coding

In modeling phase we try to extract informationabout any redundancy that exists in the data anddescribe the redundancy in the form of a model.

2

A description of the model and a “description” ofhow the data differ from the model are coded,generally using a binary alphabet.

The difference between the data and the model isoften referred to as the residual.

In the following three examples we will look at threedifferent ways that data can be modeled. We willthen use the model to obtain compression.

3

If binary representations of these numbers is to betransmitted or stored, we would need to use 5 bits persample.

By exploiting the structure in the data, we can represent

the sequence using fewer bits.

If we plot this data as shown in following Figure

4

We see that the dataseem to fall on a straightline.

A model for the datacould therefore be astraight line given by theequation:

5

To examine the difference between the data and themodel. The difference (or residual) is computed:

En = An −Ān = 0 1 0 −1 1 −1 0 1 −1 −1 1 1

The residual sequence consists of only three numbers −1 0 1.

If we assign a code of 00 to −1, a code of 01 to 0, and a code of10 to 1, we need to use 2 bits to represent each element of theresidual sequence.

Therefore, we can obtain compression by transmitting orstoring the parameters of the model and the residualsequence.

The encoding can be exact if the required compression is to belossless, or approximate if the compression can be lossy.

6

The number of distinct values has been reduced. Fewer bits are required to represent each number and

compression is achieved. The decoder adds each received value to the previous decoded

value to obtain the reconstruction corresponding to the receivedvalue.

8

The sequence is made up of eight different symbols.

we need to use 3 bits per symbol.

Say we have assigned a codeword with only a singlebit to the symbol that occurs most often, andcorrespondingly longer codewords to symbols thatoccur less often.

9

If we substitute the codes for each symbol, we will use 106 bits to encode the entire sequence.

As there are 41 symbols in the sequence, this works out to approximately 2.58 bits per symbol.

This means we have obtained a compression ratio of 1.16:1.

11

Information Amount of Information Entropy Maximum Entropy Condition for maximum entropy

12

Compression is achieved by removing data redundancywhile preserving information content.

The information content of a group of bytes (amessage).

Entropy is the measure of information content in amessage.

Messages with higher entropy carry more informationthan messages with lower entropy.

Data with low entropy permit a larger compression ratiothan data with high entropy.

13

How to determine the entropy

Find the probability p(x) of symbol x in the message

The entropy H(x) of the symbol x is:

H(x) = - p(x) • log2p(x)

The average entropy over the entire message is the sum of the entropy of all n symbols in the message.

14

3 mathematical priliminaries data compression

Engineering