ece 566 - information theoryclasses.engr.oregonstate.edu/eecs/winter2020/ece566... · thus in...

Post on 02-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ECE 566 - INFORMATION THEORYProf. Thinh Nguyen1

LOGISTICS Textbook: Elements of Information Theory, Thomas Cover and Joy

Thomas, Wiley, second edition

Class email list: ece566-w20@engr.orst.edu

2

LOGISTICS

Grading Policy

Quiz 1: 15% Quiz 2: 15% Quiz 3: 15% Quiz 4: 15%

Lowest score quiz will be dropped

Homework: 15% Final: 40%

3

WHAT IS INFORMATION?

Information theory deals with the concept ofinformation, its measurement and itsapplications.

4

ORIGIN OF INFORMATION THEORY

Two schools of thoughts British

Semantic information : related to the meaning of messages Pragmatic information : related to the usage and effect of

messages American

Syntactic: related to the symbols from which messages arecomposed, and their interrelations

5

ORIGIN OF INFORMATION THEORY

Consider the following sentences1. Jacqueline was awarded the gold medal by the judges at

the national skating competition.2. The judges awarded Jacqueline the gold medal at the

national skating competition.3. There is a traffic jam on I-5 between Corvallis and

Eugene in Oregon4. There is a traffic jam on I-5 in Oregon

(1) and (2): syntactically different but semanticallyand pragmatically identical.

(3) and (4): syntactically and semantically different,(3) gives more precise information than (4)

(3) and (4) are irrelevant for Californians. 6

ORIGIN OF INFORMATION THEORY

The British tradition is closely related tophilosophy, psychology and biology. Influencedscientists include MacKay, Carnap, Bar-Hillel, … Very hard if not impossible to quantify

The American traditional deals with thesyntactic aspects of information. Influencedscientists include Shannon, Renyi, Gallager andCsiszar, … Can be rigorously quantified by mathematics

7

ORIGIN OF INFORMATION THEORY

Basic questions of information theory in the American tradition involve the measurement of syntactic information

The fundamental limits on the amount of information which can be transmitted

The fundamental limits on the compression of information which can be achieved

How to build information processing systems approaching these limits?

8

ORIGIN OF INFORMATION THEORY

H. Nyquist (1924) published an article whereinhe discussed how messages (characters) could besent over a telegraph channel with maximumpossible speed, but without distortion.

R. Hartley (1928) who first defined a measure ofinformation as the logarithm of the number ofdistinguishable messages one can representusing n characters, each can take s possibleletters.

9

ORIGIN OF INFORMATION THEORY

For a message of length n, we have the Hartley’s measure of information as:

For a message of length of 1, we have the Hartley’s measure of information as:

This definition corresponds with the intuitive idea that a message consisting of n symbols contains ntimes as much information as a message consisting of only one symbols.

)log()log()( snssH nnH ==

)log(1)log()( 11 sssH H ==

10

ORIGIN OF INFORMATION THEORY

Note that any function would satisfy our intuition. One can show that the only functions that satisfy such equation is of the form:

The choice of a is arbitrary, and is more a matter of normalization.

)()( snfsf n =

)(log)( ssf a=

11

AND NOW…, THE SHANNON’S INFORMATION THEORY(a.k.a communication theory, mathematical information theory, or in short as information theory)

12

CLAUDE SHANNON

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.” (Claude Shannon 1948)

Channel Coding Theorem: It is possible to achieve near perfect communication

of information over a noisy channel.

In this course we will: Define what we mean by information Show how we can compress the information in a

source to its theoretically minimum value and show the tradeoff between data compression and distortion.

Prove the Channel Coding Theorem and derive the information capacity of different channels

1916-2001

13

)(log)()](log[)( 22 xpxpXEXHAx∑∈

−=−=

The little formula that starts it all

14

SOME FUNDAMENTAL QUESTIONS

What is the minimum number of bits that can be used to represent the following texts?

(that can be addressed by information theory)

"There is a wisdom that is woe; but there is a woe that is madness. And there is a Catskill eagle in some souls that can alike dive down into the blackest gorges, and soar out of them again and become invisible in the sunny spaces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lowest swoop the mountain eagle is still higher than other birds upon the plain, even though they soar."

- Moby Dick, Herman Melville

15

SOME FUNDAMENTAL QUESTIONS

How fast can information be sent reliably over an error prone channel?

(that can be addressed by information theory)

"Thera is a wisdem that is woe; but thera is a woe that is madness. And there is a Catskill eagle in sme souls that can alike divi down into the blackst goages, and soar out of thea agein and become invisble in the sunny speces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lotest swoap the mountin eagle is still highhr than other berds upon the plain, even though they soar."

- Moby Dick, Herman Melville

16

FROM QUESTIONS TO APPLICATIONS

17

SOME NOT SO FUNDAMENTAL QUESTIONS(that can be addressed by information theory)

How to make money? (Kelly’s Portfolio Theory)

What is the optimal dating strategy?

18

FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY

E.g., Newtonian universe is deterministic, all events can be predicted with 100% accuracy by Newton’s laws, given the initial conditions.

U N I V E R S E

Physical Model (Laws)

If there is no irreducible randomness, the description/history of our universe, therefore can be compressed into a few equations.

19

FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY

Some well-defined stochastic process that generates all the observations

U N I V E R S E

Probabilistic model

Thus in certain sense, Shannon information theory is somewhat unsatisfactory since it is based on the “ignorant model.” Nevertheless, it is a good approximation, depending on how accurate the probabilistic model used to describe the phenomenon of interest. 20

FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY

"There is a wisdom that is woe; but there is a woe that is madness. And there is a Catskill eagle in some souls that can alike dive down into the blackest gorges, and soar out of them again and become invisible in the sunny spaces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lowest swoop the mountain eagle is still higher than other birds upon the plain, even though they soar."

- Moby Dick, Herman Melville

We know the texts above is not random, but we can build a probabilistic model for it.

For example, a first order approximation could be to build a histogram of different letters in the text, then approximate the probability that a particular letter appears based on the histogram (assuming iid).

21

WHY SHANNON INFORMATION IS SOMEWHATUNSATISFACTORY?

How much Shannon information does this picture contain?

22

ALGORITHMIC INFORMATION THEORY(KOLMOGOROV COMPLEXITY)

The Kolmogorov complexity K(x) of a sequence x is the minimum size of the program and its inputs needed to generate x.

Example: If x was a sequence of all ones, a highly compressible sequence, the program would simply be a print statement in a loop. On the other extreme, if x were a random sequence with no structure, then the only program that could generate it would contain the sequence itself. 23

REVIEW OF BASIC PROBABILITY

A discrete random variable X takes a value xfrom the alphabet A with probability p(x).

24

EXPECTED VALUE

If g(x) is real valued and defined on A then

∑∈

=Ax

X xgxpXgE )()()]([

25

SHANNON INFORMATION CONTENT

The Shannon Information Content of an outcome with probability p is –log2p

26

MINESWEEPER

Where is the bomb? 16 possibilities - needs 4 bits to specify

27

ENTROPY

∑∈

−=−=Ax

xpxpXpEXH ))((log)())]((log[)( 22

28

ENTROPY EXAMPLES

Bernoulli Random Variable

29

ENTROPY EXAMPLES

Four Colored Shapes A = [ ; ; ; ]

p(X) = [1/2;1/4;1/8;1/8]

30

DERIVATION OF SHANNON ENTROPY

∑=

=n

iii ppXH

12log)(

Intuitive Requirements:

1. We want H to be a continuous function of probabilities pi . That is, a small change in pi should only cause a small change in H

2. If all events are equally likely, that is, pi = 1/n for all i, then H should be a monotonically increasing function of n. The more possible outcomes there are, the more information should be contained in the occurrence of any particular outcome.

3. It does not matter how we divide the group of outcomes, the total of information should be the same.

31

DERIVATION OF SHANNON ENTROPY

32

DERIVATION OF SHANNON ENTROPY

33

DERIVATION OF SHANNON ENTROPY

34

DERIVATION OF SHANNON ENTROPY

35

LITTLE PUZZLE

Given twelve coins, eleven of which are identical, and one is either lighter or heavier than the rest. You have a scale that can determine whether what you put on the left pan is heavier, lighter, or the same as what you put on the right pan. What is the minimum number of weighing for you determine the odd coin, and whether it is heavier or lighter than the rest?

36

JOINT ENTROPY H(X,Y)

)],(log[),( YXpEYXH −=p(X, Y) Y = 0 Y = 1

X = 0 1/2 1/4

X = 1 0 1/4

p(X,Y) Y = 0 Y = 1

X = 0 1/4 1/4

X = 1 1/4 1/4

37

JOINT ENTROPY H(X,Y)2XY =Example: ]1 ,1[−∈X 0.9] ,1.0[=Xp

p(X,Y) Y = 1

X = -1 0.1

X = 1 0.9

38

CONDITIONAL ENTROPY H(Y|X)

)]|(log[)|( XYpEXYH −=p(Y|X) Y = 0 Y = 1

X = 0 2/3 1/3

X = 1 0 1

39

CONDITIONAL ENTROPY H(Y|X)

Example: ]1 ,1[−∈X 0.9] ,1.0[=Xp

p(Y|X) Y = 1

X = -1 1

X = 1 1

2XY =

p(X|Y) Y = 1

X = -1 0.1

X = 1 0.9 40

INTERPRETATIONS OF CONDITIONALENTROPY H(Y|X) H(Y|X): The average uncertainty/information in

Y when you know X

H(Y|X): The weighted average of row entropiesp(X,Y) Y=0 Y = 1 H(Y|X) = x p(X)X=0 1/2 1/4 H(1/3) 3/4X=1 0 1/4 H(1) 1/4

41

CHAIN RULES

For two RVs,

In general,

),...,|(),...,,( 121

121 XXXXHXXXH i

n

iiin −

=−∑=

)|()()|()(),( YXHYHXYHXHYXH +=+=

Proof:

Proof:

42

GRAPHICAL VIEW OF ENTROPY, JOINTENTROPY, AND CONDITIONAL ENTROPY

H(X,Y)

H(Y)H(X)

H(X|Y)H(Y|X)

)|()()|()(),( YXHYHXYHXHYXH +=+=

43

MUTUAL INFORMATION

The mutual information is the average amount of information that you get about X from observing the value of Y

),()()()|()();( YXHYHXHYXHXHYXI −+=−=

H(X,Y)

H(Y)H(X)

H(X|Y)H(Y|X)

I(X;Y)

44

MUTUAL INFORMATION

The mutual information is symmetrical

);();( XYIYXI =

Proof:

45

MUTUAL INFORMATION EXAMPLE

If you try to guess Y, you have a 50% chance of being correct

However, what if you know X? Best guess: choose Y = X What is the overall probability

of guessing correctly?

p(X, Y) Y = 0 Y = 1

X = 0 1/2 1/4

X = 1 0 1/4

H(X,Y)=1.5

H(Y)=1H(X)=0.811

H(X|Y)=0.5

H(Y|X)=0.689I(X;Y)=0.311

46

CONDITIONAL MUTUAL INFORMATION

)|,()|()|(),|()|()|;( ZYXHZYHZXHZYXHZXHZYXI −+=−=

Note: Z conditioning applies to both X and Y

47

CHAIN RULE FOR MUTUAL INFORMATION

),...,,|;();,...,,( 1211

21 XXXYXIYXXXI ii

n

iin −−

=∑=

Proof:

48

EXAMPLE OF USING CHAIN RULE FORMUTUAL INFORMATION

X +

Z

Y

Find I(X,Z;Y).

p(X, Z) Z = 0 Z = 1

X = 0 1/4 1/4

X = 1 1/4 1/4

49

CONCAVE AND CONVEX FUNCTIONS

f(x) is strictly convex over (a,b) if

1 0 ),,( )()1()())1(( <<∈≠∀−+<−+ λλλλλ bavuvfufvuf

Examples:

Technique to determine the convexity of a function:

50

JENSEN’S INEQUALITY

convex ])[()]([ XEfXfE ≥)(Xf

strictly convex ])[()]([ XEfXfE >)(Xf

Proof:

51

JENSEN’S INEQUALITY EXAMPLE

Mnemonic example:2)( xxf =

52

RELATIVE ENTROPY

Relative entropy of Kullback-Leibler Divergence between two probability mass vectors (functions) p and q is defined as:

)()](log[])()([log

)()(log)()||( XHxqE

xqxpE

xqxpxpqpD p

Axp −−===∑

Property of )||( qpD

53

RELATIVE ENTROPY EXAMPLE

Rain Cloudy SunnyWeather at

Seattlep(x) 1/4 1/2 1/4

Weather at Corvallis

q(x) 1/3 1/3 1/3

)||( qpD

)||( pqD54

INFORMATION INEQUALITY

0)||( ≥qpD

Proof:

55

INFORMATION INEQUALITY

Uniform distribution has highest entropyProof:

Mutual Information is non-negativeProof:

0);( ≥YXI

||log)( AXH ≤

56

INFORMATION INEQUALITY

Conditioning reduces entropyProof:

Independence bound Proof:

∑=

≤n

iin XHXXXH

121 )(),...,,(

)()|( XHYXH ≤

57

INFORMATION INEQUALITY

Conditional independence bound

Proof:

Mutual information independence bound

Proof:

∑=

≥n

iiinn YXIYYYXXXI

12121 );(),...,,;,...,,(

∑=

≤n

iiinn YXHYYYXXXH

12121 )|(),...,,|,...,,(

58

SUMMARY

59

LECTURE 2: SYMBOL CODE

Symbol codes Non-singular codes Uniquely decodable codes Prefix codes (instantaneous codes)

Kraft Inequality Minimum Code Length

60

SYMBOL CODES

Mapping of letters from one alphabet to another

ASCII codes: computers do not process English language directly. Instead, English letters are first mapped into bits, then processed by the computer.

Morse codes

Decimal to binary conversion

Process of mapping from letters in one alphabet to letters in another is called coding. 61

SYMBOL CODES

Symbol code is a simple mapping. Formally, symbol code C is a mapping X D+

X = the source alphabet D = the destination alphabet D+ = set of all finite strings from D

Example:

{X,Y,Z} {0,1}+ , C(X) = 0, C(Y) = 10, C(Z) = 1

Codeword: the result of a mapping, e.g., 0, 10, … Code: a set of valid codewords. 62

SYMBOL CODES

Extension: C+ is a mapping X+ D+ formed by concatenating C(xi) without punctuation

Example: C(XXYXZY) = 001001110

Non-singular: x1≠x2 → C(x1) ≠C(x2)

Uniquely decodable: C+ is non-singular That is C+(x+) is unambiguous

63

PREFIX CODES

Instantaneous or Prefix code: No codeword is a prefix of another

Prefix → uniquely decodable → non-singular

Examples:

C1 C2 C3 C4W 0 00 0 0X 11 01 10 01Y 00 10 110 011Z 01 11 1110 0111

64

CODE TREE

Form a D-ary tree where D = |D|

D branches at each node Each branch denotes a symbol di in D Each leaf denotes a source symbol xi in X C(xi) = concatenation of symbols di along

the path from the root to the leaf that represents xi

Some leaves may be unused

X

W

Y

Z

0

1

1

00

01

1

C(WWYXZ) = 65

KRAFT INEQUALITY FOR PREFIX CODES

For any prefix code, we have:

where li is the length of codeword i and D = |D|. Proof:

∑ ≤−||

1X

i

liD

66

KRAFT INEQUALITY FOR UNIQUELYDECODABLE CODES

For any uniquely decodable code with codewords of lengths l1, l2, …, l|X|, then

Proof:∑ ≤−

||

1X

i

liD

67

CONVERSE OF KRAFT INEQUALITY

If , then there exists a prefix code with codelengths l1, l2, … l|X|.

Proof:

∑ ≤−||

1X

i

liD

68

EXAMPLE OF CONVERSE OF KRAFTINEQUALITY

69

OPTIMAL CODE

If l(C(x))= the length of C(x), then C is optimal if LC = E[l(C(X))] is as small as possible for a given p(X).

C is a uniquely decodable code → LC ≥H(X)/log2D

Proof:

70

SUMMARY

Symbol code

Kraft inequality for uniquely decodable code

Lower bound for any uniquely decodable code

71

LECTURE 3: PRACTICAL SYMBOL CODES

Fano code Shannon code Huffman code

72

FANO CODE

Put the probabilities in decreasing order Split as close to 50-50 as possible. Repeat with each half

73

H(p) = 2.81 bits

L F=2.89 bits

Not optimal!

Lopt =2.85 bits

CONDITIONS FOR OPTIMAL PREFIX CODE(ASSUMING BINARY PREFIX CODE) An optimal prefix code must satisfy:

p(xi) > p(xj) → l(xi) ≤ l(xj) (else swap them)

The two longest codewords must have the same length ( else chop a bit off the codeword)

In the tree corresponding to the optimum code, there must be two branches stemming from each intermediate node.

74

HUFFMAN CODE CONSTRUCTION

1. Take the two smallest p(xi) and assign each a different last bit. Then merge into a single symbol.

2. Repeat step 1 until only one symbol remains

75

HUFFMAN CODE EXAMPLE

76

HUFFMAN CODE EXAMPLE (CONT)Letter Probability Codeword

x1 0.4x2 0.2x3 0.2x4 0.1x5 0.1

77

HUFFMAN CODE IS OPTIMAL PREFIXCODE

78

We want to show that all of these codes (include c5) is optimal

HUFFMAN OPTIMALITY PROOF

79

HOW SHORT ARE THE OPTIMAL CODE? If l(x) = length(C(x)) then C is optimal if LC=E l(x)

is as small as possible.

We want to minimize

Subject to:

andl(xi) are integers

80

∑i

ii xlxp )()(

1)( ≤∑ −

i

xl iD

OPTIMAL CODES (NON-INTEGER LENGTH) Minimize

subject to:

81

∑i

ii xlxp )()(

1)( ≤∑ −

i

xl iD

Use Lagrange multiplier method:

SHANNON CODE

Round up optimal code lengths:

li satisfy the Kraft Inequality → Prefix code exists. Construction for such code is done earlier.

Average length:

Proof:

82

)(log iDi xpl −=

1log

)(log

)(+≤≤

DXHL

DXH

S

SHANNON CODE EXAMPLES

x1 x2 x3 x4

p(xi) 1/2 1/4 1/8 1/8

83

x1 x2

p(xi) 0.99 0.01

HUFFMAN CODE VS. SHANNON CODE

84

x1 x2 x3 x4

p(xi) 0.34 0.36 0.25 0.5Shannon 2 2 2 5Huffman 1 2 3 3

)(log ixp−

LS = 2.15 bits, LH = 1.94 bits

H(X) = 1.78 bits

SHANNON COMPETITIVE OPTIMALITY

l(x) is length of a uniquely decodable code, lS(x) is length of Shannon code then

85

cS cXlXlp −≤−< 12))()((

Proof:

DYADIC COMPETITIVE OPTIMALITY

If p(x) is dyadic ↔ log(p(xi)) is integer for all i, then

with equality iff l(X) ≡ lS(X)Proof:

86

))()(())()(( XlXlpXlXlp SS <≤<

SHANNON WITH WRONG DISTRIBUTION

If the real distribution of X is p(x) but you assign Shannon lengths using the distribution q(x) what is the penalty ?Answer: D(p||q)

Proof:

87

SUMMARY

88

LECTURE 4 Stochastic Processes Entropy Rate Markov Processes Hidden Markov Processes

89

STOCHASTIC PROCESS

Stochastic process {Xi}=X1, X2, X3, … Define entropy of {Xi} as

H({Xi} ) = H(X1, X2, X3, …)= H(X1) + H(X2|X1) + H(X3|X2,X1) + ….

Define entropy rate of {Xi} as

Entropy rate estimates the additional entropy per new sample. Gives a lower bound on number of code bits per sample. If the Xi are not i.i.d. the entropy rate limit may not exist.

90

nXXXHXH n

n

),...,,(lim)( 21_

∞→=

91

EXAMPLES

91

LIMIT OF CESARO MEAN

92

bakk→

∞→lim ba

n

n

kkn→∑

=∞→

0

1lim

Proof:

93

A stochastic process {Xi} is stationary iffnkaXaXaXPaXaXaXP nnkkknn , ),...,,(),...,,( 22112211 ∀======= +++

If {Xi} is stationary, then exists_)(XH

Proof:

STATIONARY PROCESS

93

94

BLOCK CODING IS MORE EFFICIENT THANSINGLE SYMBOL CODING

94

EXAMPLE OF BLOCK CODE

95

Discrete –valued Stochastic Process {Xi} is Markov iff

Time independent if

)|(),...,,|( 1110 −− = nnnn XXPXXXXP

abnn PbXaXP === − )|( 1

MARKOV PROCESS

96

STATIONARY MARKOV PROCESS

Finite state Markov process can be described using Markov chain diagram

97

a b

0.1

0.9

0.1

0.9

Any finite, aperiodic, irreducible Markov chain will converge to a stationary distribution regardless of starting distribution

STATIONARY MARKOV PROCESS

98

As

approaches

Each row is the stationary distribution

STATIONARY MARKOV PROCESS EXAMPLE

99

Pllp TTS ==

CHESS BOARD

100

EXAMPLE OF CODING FOR MARKOVSOURCE

a b

0.1

0.9

0.1

0.9

Find _)(XH a) Assuming the Markov model above with block code n = 2

b) Assuming i.i.d and symbol codec) Assuming i.i.d and block code with n = 2.d) Assuming i.i.d and block code with n = 3.

Which one of the above is minimum? Do you think when n = 1000, block code assuming i.i.d would be lower or higher than the answer in a)? Provide reason for your answer.

101

M users share wireless transmission channel For each user independent in each time slot

If the queue is non-empty, transmit with the probability q A new packet arrive for transmission with probability p

If two packets collide, they stay in the queue At time t, queue sizes are Xt = (n1, …, nM)

{Xt} is Markov since P(Xt|Xt-1) depends only on Xt-1

Transmit vector Yt

{Yt} is not a Markov since P(Yt) depends on Xt-1, not Yt-1.

{Yt} is called Hidden Markov Process

ALOHA WIRELESS EXAMPLE

102

ALOHA WIRELESS EXAMPLE

103

If Xi is a Markov Process, and Y = f(X), then Yi is a stationary Hidden Markov Process

What is entropy rate ?)(_YH

HIDDEN MARKOV PROCESS

104

HIDDEN MARKOV PROCESS

105

HIDDEN MARKOV PROCESS

106

SUMMARY Entropy rate

Stationary

Stationary Markov

Hidden Markov

107

LECTURE 5 Stream codes Arithmetic code Lempel-Ziv code

108

WHY NOT USE HUFFMAN CODE? (SINCE IT ISOPTIMAL)

Good It is optimal, i.e., shortest possible code

Bad Bad for skewed probability Employ block coding helps, i.e., use all N symbols in a

block. The redundancy is now 1/N. However, Must re-compute the entire table of block symbols if

the probability of a single symbol changes. For N symbols, a table of |X|N must be pre-calculated

to build the treeSymbol decoding must wait until an entire block is

received109

ARITHMETIC CODE

Developed by IBM (too bad, they didn’t patent it) Majority of current compression schemes use it

Good: No need for pre-calculated tables of big size Easy to adapt to change in probability Single symbol can be decoded immediately without

waiting for the entire block of symbols to be received.

110

BASIC IDEA IN ARITHMETIC CODING

Sort the symbols in lexical order

Represent each string x of length n by a unique interval [F(xi-1),F(xi))in [0,1). F(xi)is the cumulative distribution.

F(xi) – F(xi-1) represents the probability of xi occurring.

The interval [F(xi-1),F(xi)) can itself be represented by any number, called a tag, within the half open interval.

The significant bits of the tag 0.t1t2t3... is the code

of x. That is, 0.t1t2t3...tk000... is in the interval [F(xi-1),F(xi))111

1)(

1log)( +

=

xPxl

TAG IN ARITHMETIC CODE

Choose the tag

112

2)()(

2)()(

2)()())),(([ 1

1

1

11

_ xFxFxXPxFxXPxXPxxFT XiXiiX

i

k

ikiiX

+=

=+=

=+== −

=− ∑

ARITHMETIC CODING EXAMPLE

113

0

ARITHMETIC CODING EXAMPLE

114

EXAMPLE OF ARITHMETIC CODING

115

Symbol Code

1234

.5

.75

.8751.0

.25

.625

.8125

.9375

.010

.101

.1101

.1111

2344

0110111011111

( )xF ( )xT X BinaryIn( ) 11log +

xP

1 2 3 4P(X) 0.5 0.25 0.125 0.125

Message Code

11121314212223243132333441424344

.25

.125

.0625

.0625

.125

.0625

.03125

.03125

.0625

.03125

.015625

.015625

.0625

.03125

.015625

.015625

.125

.3125

.40625

.46875

.5625

.65625

.703125

.734375

.78125

.828125

.8515625

.8671875

.90625

.953125

.9765625

.984375

.001

.0101

.01101

.01111

.1001

.10101

.101101

.101111

.11001

.110101

.1101101

.1101111

.11101

.111101

.1111101

.1111111

3455456656775677

0010101011010111110011010110110110111111001110101110110111011111110111110111111011111111

( )xP ( )xT X ( ) BinaryinxT X( ) 11log +

xP

ARITHMETIC CODE FOR TWO-SYMBOLSEQUENCES

PROGRESSIVE DECODING

117

F(x)

UNIQUENESS OF THE ARITHMETIC CODE

Idea of proving arithmetic is a uniquely decodable code Show that the truncation of the tag lies entirely within

[F(xi-1, F(xi)), hence singular code Show that the truncation of the tag results in prefix code,

hence uniquely decodable. Proof:

118

UNIQUENESS OF THE ARITHMETIC CODE

119

UNIQUENESS OF THE ARITHMETIC CODE

120

EFFICIENCY OF THE ARITHMETIC CODE

Off at most by 2/N

Proof:

121

NNXXXHl

NXXXH N

AN 2),...,,(),...,,( 2121 +<≤

EFFICIENCY OF THE ARITHMETIC CODE

122

ADAPTIVE ARITHMETIC CODE

123

DICTIONARY CODING

index pattern

1 a2 b3 ab…n abc

Encoder codes the index

EncoderIndices

Decoder

Both encoder and decoder are assumed to have the same dictionary (table)

ZIV-LEMPEL CODING (ZL OR LZ) Named after J. Ziv and A. Lempel (1977).

Adaptive dictionary technique. Store previously coded symbols in a buffer. Search for the current sequence of symbols to code. If found, transmit buffer offset and length.

LZ77

a b c a b d a c a b d e e e fc

Search buffer Look-ahead buffer

Output triplet <offset, length, next>

123456783

8 0 13 d 0 e 2 f

2

Transmitted to decoder:

If the size of the search buffer is N and the size of the alphabet is Mwe need

bits to code a triplet.

Variation: Use a VLC to code the triplets!PKZip, Zip, Lharc,PNG, gzip, ARJ

DRAWBACKS OF LZ77 Repetetive patterns with a period longer than the

search buffer size are not found. If the search buffer size is 4, the sequence

a b c d e a b c d e a b c d e a b c d e …will be expanded, not compressed.

SUMMARY

Stream codes Arithmetic Lempel-Ziv

Encoder and decoder operate sequentiallyno blocking of input symbols required

Not forced to send ≥1 bit per input symbol Can achieve entropy rate even when H(X)<1 Require a Perfect Channel

A single transmission error causes multiple wrong output symbols

Use finite length blocks to limit the damage 128

LECTURE 6 Markov chains Data Processing theorem

You can’t create information from nothing (no free lunch) Fano’s inequality

Lower bound for error in estimating X from Y

129

MARKOV CHAIN

Given 3 random variables X,Y,Z, they form a Markov chain denoted as X→Y →Z if

A Markov chain X→Y →Z means that The only way that X affects Z is through the value of Y I (X ;Z |Y ) = 0 ⇔ H(Z|Y ) = H(Z |X ,Y ) If you already know Y, then observing X gives you no

additional information about Z If you know Y, then observing Z gives you no additional

information about X A common special case of a Markov chain is when Z

= f(Y)

130

)|(),|()|()|()(),,( yzpyxzpyzpxypxpzyxp =⇔=

MARKOV CHAIN SYMMETRY

X→Y →Z ↔ X and Z are conditionally independent given Y

X→Y →Z ↔ Z→Y →X

131

)|()|()|,( yzpyxpyzxp =

DATA PROCESSING THEOREM

If X→Y→Z then I(X ;Y) ≥ I(X ; Z) Processing Y cannot add new information about X

If X→Y→Z then I(X ;Y) ≥ I(X ; Y | Z) Knowing Z can only decrease the amount X tells you

about Y Proof:

132

NON-MARKOV: CONDITIONING INCREASEMUTUAL INFORMATION

Noisy channel Z = X+Y

133

LONG MARKOV CHAIN

X1→ X2 → X3 → …. Xn then mutual information increases as you get closer and closer

134

);(....);();();( 1413121 nXXIXXIXXIXXI ≥≥≥≥

SUFFICIENT STATISTICS

If pdf of X depends on a parameter θ and you extract a statistic T(X) from your observation, then θ → X → T(X ) ⇒ I (θ ;T(X )) ≤ I (θ ;X )

T(x) is sufficient for θ if the stronger condition:

135

))(|()),(|( )(

);())(;()(

XTXpXTXpXXT

XIXTIXTX

=⇔→→→⇔

=⇔→→→

θθθ

θθθθ

EXAMPLES SUFFICIENT STATISTICS

136

FANO’S INEQUALITY

If we estimate X from Y, what is ?

N is the size of the outcome set of XProof:

137

)(^XXP ≠

)1log()1)|((

)1log())()|(()(

^

−−

≥−−

≥≠≡N

YXHN

PHYXHXXPP ee

FANO’S INEQUALITY EXAMPLE

138

SUMMARY

Markov

Data Processing Theorem

Fano’s Inequality

139

LECTURE 7 Weak Law of Large Numbers The Typical Set Asymptotic Equipartition Principle

140

STRONG AND WEAK TYPICALITY

Strongly typical sequences: ABAACABCAABC Correct proportion H(X) = -(0.5log(0.5) + 0.5log(0.25)) = 2.5 bits

Weak typical sequences: BBBBBBBBBBAA Incorrect proportion

141

A B CP(X) 0.5 0.25 0.25

CONVERGENCE OF RANDOM NUMBERS

Almost Sure Convergence:

Example:

142

1)limPr( ==∞→

XX nn

CONVERGENCE OF RANDOM NUMBERS

Convergence in Probability:

Example:

143

0 ,0)|(Pr(|lim >∀→≥−∞→

εεXX nn

WEAK LAW OF LARGE NUMBERS

Given i.i.d let , then

144

{ },iX ∑=

=n

iin X

ns

1

1

,][][ µ== XEsE n21)(1)( σ

nXVar

nsVar n ==

As n increases, the variance of sn decreases, i.e., values of sn become clustered around the mean

WLLN

0)|(| , →≥−→ ∞→nn

probn sPs εµµ

PROOF OF WEAK LAW OF LARGENUMBERS

Chebyshev’s Inequality

WLLN

145

TYPICAL SET

is the i.i.d sequence of

146

nx niX i to1for }{ =

}|)()(log:|{ 1 εε <−−Χ∈= − XHxpnxT nnnn

EXAMPLE OF TYPICAL SET

Bernoulli with p

147

TYPICAL SETS

148

TYPICAL SET PROPERTIES

149

ASYMPTOTIC EQUIPARTITION PRINCIPLE

150

surprisingequally almost isevent any almost , and ,any For εε Nn >

SOURCE CODING AND DATA COMPRESSION

151

152

εεε −>∈ 1)( that ensure to,What nn TxPN

SMALLEST HIGH PROBABILITY SET

153

SUMMARY

154

LECTURE 8 Source and Channel Coding Discrete Memoryless Channels

Binary Symmetric Channel Binary Erasure Channel Asymmetric Channel

Channel Capacity

155

SOURCE AND CHANNEL CODING

Source Coding

Channel Coding156

DISCRETE MEMORYLESS CHANNEL

Discrete input, discrete output Channel matrix Q

Memoryless

157

BINARY CHANNELS

Binary symmetric channel

Binary erasure channel

Z channel

158

WEAKLY SYMMETRIC CHANNELS

1. All columns of Q have the same sum

2. Rows of Q are permutation of each other

159

SYMMETRIC CHANNEL

1. Rows of Q are permutations of each other

2. Columns of Q are permutations of each other

160

CHANNEL CAPACITY

Capacity of discrete memoryless channel

Capacity of n uses of channel:

161

);(max)(

YXICxp

=

),...,,,;,...,,(max1321321)( nnxp

n YYYYXXXXIn

C =

MUTUAL INFORMATION

162

MUTUAL INFORMATION IS CONCAVE IN

Mutual information is concaved in p(x) for a fixed p(y|x)

Proof:

163

)(xp

MUTUAL INFORMATION IS CONVEX IN

Mutual information is convex in given a fixed

Proof:

164

)|( xyp

)|( xyp)(xp

N USE CHANNEL CAPACITY

165

CAPACITY OF WEAKLY SYMMETRICCHANNEL

166

BINARY ERASURE CHANNEL

167

ASYMMETRIC CHANNEL CAPACITY

168

SUMMARY

169

LECTURE 9 Jointly Typical Sets Joint AEP Channel Coding Theorem

170

SIGNIFICANCE OF MUTUAL INFORMATION

171

JOINTLY TYPICAL SETS

172

} |),()),((log| ,|)()(log| ,|)()(log:|),{(

1

1

1

ε

ε

εε

<−−

<−−

<−−Χ∈=

YXHyxpnXHxpnXHxpnYyxJ

nn

n

nnnnn

JOINTLY TYPICAL EXAMPLE

173

p(x,y) 0 10 0.5 0.1251 0.125 0.25

JOINTLY TYPICAL DIAGRAM

174

JOINTLY TYPICAL SET PROPERTIES

175

JOINT AEP

176

)3),(()3),(( 2)),((2)1( then,t independen are ',' and ),'()(),'()( If

εε

εε −−+− ≤∈≤−

==YXInnnnYXIn Jyxp

YXypypxpxp

CHANNEL CODE

177

Noisy channelEncoder

Mw ,...2,1∈ nx ny Decoder)(ˆ nygw =

Mw ,...2,1ˆ ∈

ACHIEVABLE CODE RATES

178

CHANNEL CODING THEOREM

179

CHANNEL CODING THEOREM

180

SUMMARY

181

LECTURE 10 Channel Coding Theorem

182

CHANNEL CODING: MAIN IDEA

183

Noisy channel

xn yn

RANDOM CODE

184

RANDOM CODING IDEA

185

x1Noisy channel

xn yn

ERROR PROBABILITY OF RANDOM CODE

186

CODE SELECTION AND EXPURGATION

187

PROCEDURE SUMMARY

188

CONVERSE OF CHANNEL CODINGTHEOREM

189

LECTURE 11 Minimum Bit Error Rate Channel with Feedback Joint Source Channel Coding

190

MINIMUM BIT ERROR RATE

191

CHANNEL WITH FEEDBACK

192

CAPACITY WITH FEEDBACK

193

EXAMPLE: BINARY ERASURE CHANNELWITH FEEDBACK

194

JOINT SOURCE CHANNEL CODING

195

yn vnnv̂nynxnv

SOURCE CHANNEL PROOF (→)

196

SOURCE CHANNEL PROOF (←)

197

SEPARATION THEOREM

198

top related