lec5 compression

Compression

For sending and storing information

Text, audio, images, videos

Common Applications

• Text compression

– loss-less, gzip uses Lempel-Ziv coding, 3:1 compression

– better than Huffman

• Audio compression

– lossy, mpeg 3:1 to 24:1 compression

– MPEG = motion picture expert group

• Image compression

– lossy, jpeg 3:1 compression

– JPEG = Joint photographic expert group

• Video compression

– lossy, mpeg 27:1 compression

Text Compression

• Prefix code: one, of many, approaches

– no code is prefix of any other code

– constraint: loss-less

– tasks

• encode: text (string) -> code

• decode: code --> text

– main goal: maximally reduce storage, measured by compression ratio

– minor goals:

• simplicity

• efficiency: time and space

– some require code dictionary or 2 passes of data

Simplest Text Encoding

• Run-length encoding

• Requires special character, say @

• Example Source:

– ACCCTGGGGGAAAACCCCCC

• Encoding:

– A@C3T@G5@4A@C6

• Method

– any 3 or more characters are replace by @char#

• +: simple

• -: special characters, non-optimal

Shannon’s Information theory (1948)How well can we encode?

• Shannon’s goal: reduce size of messages for improved communication

• What messages would be easiest/hardest to send?

– Random bits hardest - no redundancy or pattern

• Formal definition: S, a set of symbols si

• Information content of S = -sum pi*log(pi)

– measure of randomness

– more random, less predictable, higher information content!

• Theorem: only measure with several natural properties

• Information is not knowledge

• Compression relies on finding regularities or redundancies.

Example

• Send ACTG each occurring 1/4 of the time

• Code: A--00, C--01, T--10, G--11

• 2 bits per letters: no surprise

• Average message length:

– prob(A)*codelength(A)+prob(B)*codelength(B) +…

– 1/4*2+…. = 2 bits.

• Now suppose:

– prob(A) = 13/16 and other 1/16

– Codes: A - 1; C-00, G-010, T-011 (prefix)

– 13/16*1+ 1/16*2+ 1/16*3+1/16*3=21/16 = 1.3+

• What is best result? Part of the answer:

• The information content! But how to get it?

Understanding Entropy/Information

• Suppose a set S is divided into k classes

• Let ni be the number of elements in class i

• Let N be the sum of all ni.

• Let pi be ni/N (the frequency of class i)

• Entropy(S) = -p1*log(p1) - p2*log(p2) -….-pk*log(pk).

• Note if k = 2, same as before.

• If all classes equally likely (pi = 1/k) then– Entropy(S) = - 1/k*log(1/k) - … = -log(1/k) = log(k)

– If k = power of 2, then this is number of bits to distinguish all classes

• If one class has probability 1, then– Entropy(S) = - 0*log(..) -… -1*log(1) … = 0

– Set isn’t mixed up at all.

• Intuitively entropy gives right answers.

• Learning Hint: To understand equations, try special cases.

The Shannon-Fano Algorithm

• Earliest algorithm: Heuristic divide and conquer

• Illustration: source text with only letters ABCDE

• Symbol A B C D E

• ----------------------------------

• Count 15 7 6 6 5

• Intuition: frequent letters get short codes

• 1. Sort symbols according to their frequencies/probabilities, i.e. ABCDE.

• 2. Recursively divide into two parts, each with approx. same number of counts.

• This is instance of “balancing” which is NP-complete.

• Note: variable length codes.

Shannon-Fano Tree

N o te P re fix p rop e rty

a00

b01

-

c10

d1 10

e1 11

-

-

-0 1

Result for this distribution

• Symbol Count -log(1/p) Code (# of bits)

------ ----- -------- --------- --------------------

A 15 1.38 00 30

B 7 2.48 01 14

C 6 2.70 10 12

D 6 2.70 110 18

E 5 2.96 111 15

TOTAL (# of bits): 89

average message length = 89/39=2.3

Note: Prefix property for decoding

Can you do better?

Theoretical optimum = -sum pi*log(pi) = entropy

Code Tree Method/Analysis

• Binary tree method

• Internal nodes have left/right references:

– 0 means go to the left

– 1 means go to the right

• Leaf nodes store the value

• Decode time-cost is O(logN)

• Decode space-cost is O(N)

– quick argument: number of leaves > number of internal nodes.

– Proof: induction on …..

• number of internal nodes.

• Prefix Property: each prefix uniquely defines char.

Code Encode(character)

• Again can use binary prefix tree

• For encode and decode could use hashing

– yields O(1) encode/decode time

– O(N) space cost ( N is size of alphabet)

• For compression, main goal is reducing storage size

– in example it’s the total number of bits

– code size for single character = depth of tree

– code size for document = sum of (frequency of char * depth of character)

– different trees yield different storage efficiency

– What’s the best tree?

Huffman Code

• Provably optimal: i.e. yields minimum storage cost

• Algorithm: CodeTree huff(document)

1. Compute the frequency and a leaf node for each char

• leaf node has countfield and character

2. Remove the 2 nodes with least counts and create a new node with count equal to the sum of counts and sons, the removed nodes.

• internal node has 2 node ptrs and count field

3. Repeat 2 until only 1 node left.

4. That’s it!

Bad code example

char code frequency bits

a 000 10 30

e 001 15 45

i 010 12 36

s 011 3 9

t 10 4 8

Total 128

Tree, a la Huffman

R e p ea t: M e rge low e st fre qu e ncy no d es

10

3 4

7

17

15 12

27

44

Tree with codes: note Prefix property

fre q / cod e / ch ar

1000a

30 10

s

40 11

t

7

17

1510e

1211i

27

44

Tree Cost

b its/n o de to ta l b its : 95 (be fo re 12 8)

1 0 /2 /20

3 /3 /9 4 /3 /12

7

17

1 5/2 /30 1 2 /2 /24

27

44

Analysis

• Intuition: least frequent chars get longest codes or most frequent chars get shortest codes.

• Let T be a minimal code tree. (Induction)

– All nodes have 2 sons. (by construction)

– Lemma: if c1 and c2 be least frequently used then they are at the deepest depth

• Proof:

– if not deepest nodes, exchange and total cost (number of bits) goes down

Analysis (continued)

• Sk : Huffman algorithm on k chars produces optimal code.

– S2: obvious

– Sk => Sk+1

• Let T be optimal code on k+1 chars

• By lemma, two least freq chars are deepest

• Replace two least freq char by new char with freq equal to sum

• Now have tree with k nodes

• By induction, Huffman yields optimal tree.

Lempel-Ziv

• Input: string of characters

• Internal: dictionary of (codewords, words)

• Output: string of codewords and characters.

• Codewords are distinct from characters.

• In algorithm, w is a string, c is character and w+c means concatenation.

• When adding a new word to the dictionary, a new code word needs to be assigned.

Lempel-Ziv Algorithm

w = NIL;

while ( read a character c )

{

if w+c exists in the dictionary

w = w+c;

else

add w+c to the dictionary;

output the code for w;

w = k;

}

Adaptive Encoding

• Webster has 157,000 entries: could encode in X bits

– but only works for this document

– Don’t want to do two passes

• Adaptive Huffman

– modify model on the fly

• Zempel-Liv 1977

• ZLW Zempel-Liv Welsh

– 1984 used in compress (UNIX)

– uses dictionary method

– variable number of symbols to fixed length code

– better with large documents- finds repetitive patterns

Audio Compression

• Sounds can be represented as a vector valued function

• At any point in time, a sound is a combination of different frequencies of different strengths

• For example, each note on a piano yields a specific frequency.

• Also, our ears, like pianos, have cilia that responds to specific frequencies.

• Just like sin(x) can be approximated by small number of terms, e.g. x -x^3/3+x^5/120…, so can sound.

• Transforming a sound into its “spectrum” is done mathematically by a fourier transform.

• The spectrum can be played back, as on computer with sound cards.

Audio

• Using many frequencies, as in CDs, yields a good approximation Using few frequenices, as in telephones, a poor approximation

• Sampling frequencies yields compresssion ratios between 6 to 24, depending on sound and quality

• High-priced electronic pianos store and reuse “samples” of concert pianos

• High filter: removes/reduces high frequencies, a common problem with aging

• Low filter: removes/reduces low frequencies

• Can use differential methods:

– only report change in sounds

Image Compression

• with or without loss, mostly with

– who cares about what the eye can’t see

• Black and white images can regarded as functions from the plane (R^2) into the reals (R), as in old TVs

– positions vary continuous, but our eyes can’t see the discreteness around 100 pixels per inch.

• Color images can be regarded as functions from the plane into R^3, the RGB space.

– Colors are vary continuous, but our eyes sample colors with only 3 difference receptors (RGB)

• Mathematical theories yields close approximation

– there are spatial analogues to fourier transforms

Image Compression

• faces can be done with eigenfaces

– images can be regarded a points in R^(big)

– choose good bases and use most important vectors

– i.e. approximate with fewer dimensions:

– JPEG, MPEG, GIF are compressed images

Video Compression

• Uses DCT (discrete cosine transform)

– Note: Nice functions can be approximated by

• sum of x, x^2,… with appropriate coefficients

• sum of sin(x), sin(2x),… with right coefficients

• almost any infinite sum of functions

– DCT is good because few terms give good results on images.

– Differential methods used:

• only report changes in video

Summary

• Issues:

– Context: what problem are you solving and what is an acceptable solution.

– evaluation: compression ratios

– fidelity, if loss

• approximation, quantization, transforms, differential

– adaptive, if on-the-fly, e.g. movies, tv

– Different sources yield different best approaches

• cartoons versus cities versus outdoors

– code book separate or not

– fixed or variable length codes

lec5 compression

Technology