chapter 8: differential entropy - uic - electrical and...

21
University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 8: Differential entropy University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 8 outline • Motivation • Definitions • Relation to discrete entropy • Joint and conditional differential entropy • Relative entropy and mutual information • Properties • AEP for Continuous Random Variables

Upload: danghanh

Post on 23-Mar-2018

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha DevroyeUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Chapter 8: Differential entropy

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Chapter 8 outline

• Motivation

• Definitions

• Relation to discrete entropy

• Joint and conditional differential entropy

• Relative entropy and mutual information

• Properties

• AEP for Continuous Random Variables

Page 2: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Motivation

• Our goal is to determine the capacity of an AWGN channel

YX = h X + Nh

N Gaussian noise ~ N(0,PN)

Wireless channel with fading

time time

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Motivation

• Our goal is to determine the capacity of an AWGN channel

YX = h X + Nh

N Gaussian noise ~ N(0,PN)

Wireless channel with fading

C = 12 log

!|h|2P+PN

PN

"

= 12 log (1 + SNR) (bits/channel use)

Page 3: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Motivation

• need to define entropy, mutual information between CONTINUOUS random

variables

• Can you guess?

• Discrete X, p(x):

• Continuous X, f(x):

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Definitions - densities

Page 4: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Properties - densities

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Properties - densities

Page 5: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Properties - densities

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Quantized random variables

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the logarithm of theequivalent side length of the smallest set that contains most of the prob-ability. Hence low entropy implies that the random variable is confinedto a small effective volume and high entropy indicates that the randomvariable is widely dispersed.Note. Just as the entropy is related to the volume of the typical set, thereis a quantity called Fisher information which is related to the surfacearea of the typical set. We discuss Fisher information in more detail inSections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETEENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1.Suppose that we divide the range of X into bins of length !. Let usassume that the density is continuous within the bins. Then, by the meanvalue theorem, there exists a value xi within each bin such that

f (xi)! =! (i+1)!

i!

f (x) dx. (8.23)

Consider the quantized random variable X!, which is defined by

X! = xi if i! ! X < (i + 1)!. (8.24)

!

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

Page 6: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Quantized random variables

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the logarithm of theequivalent side length of the smallest set that contains most of the prob-ability. Hence low entropy implies that the random variable is confinedto a small effective volume and high entropy indicates that the randomvariable is widely dispersed.Note. Just as the entropy is related to the volume of the typical set, thereis a quantity called Fisher information which is related to the surfacearea of the typical set. We discuss Fisher information in more detail inSections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETEENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1.Suppose that we divide the range of X into bins of length !. Let usassume that the density is continuous within the bins. Then, by the meanvalue theorem, there exists a value xi within each bin such that

f (xi)! =! (i+1)!

i!

f (x) dx. (8.23)

Consider the quantized random variable X!, which is defined by

X! = xi if i! ! X < (i + 1)!. (8.24)

!

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the logarithm of theequivalent side length of the smallest set that contains most of the prob-ability. Hence low entropy implies that the random variable is confinedto a small effective volume and high entropy indicates that the randomvariable is widely dispersed.Note. Just as the entropy is related to the volume of the typical set, thereis a quantity called Fisher information which is related to the surfacearea of the typical set. We discuss Fisher information in more detail inSections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETEENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1.Suppose that we divide the range of X into bins of length !. Let usassume that the density is continuous within the bins. Then, by the meanvalue theorem, there exists a value xi within each bin such that

f (xi)! =! (i+1)!

i!

f (x) dx. (8.23)

Consider the quantized random variable X!, which is defined by

X! = xi if i! ! X < (i + 1)!. (8.24)

!

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Quantized random variables

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the logarithm of theequivalent side length of the smallest set that contains most of the prob-ability. Hence low entropy implies that the random variable is confinedto a small effective volume and high entropy indicates that the randomvariable is widely dispersed.Note. Just as the entropy is related to the volume of the typical set, thereis a quantity called Fisher information which is related to the surfacearea of the typical set. We discuss Fisher information in more detail inSections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETEENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1.Suppose that we divide the range of X into bins of length !. Let usassume that the density is continuous within the bins. Then, by the meanvalue theorem, there exists a value xi within each bin such that

f (xi)! =! (i+1)!

i!

f (x) dx. (8.23)

Consider the quantized random variable X!, which is defined by

X! = xi if i! ! X < (i + 1)!. (8.24)

!

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the logarithm of theequivalent side length of the smallest set that contains most of the prob-ability. Hence low entropy implies that the random variable is confinedto a small effective volume and high entropy indicates that the randomvariable is widely dispersed.Note. Just as the entropy is related to the volume of the typical set, thereis a quantity called Fisher information which is related to the surfacearea of the typical set. We discuss Fisher information in more detail inSections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETEENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1.Suppose that we divide the range of X into bins of length !. Let usassume that the density is continuous within the bins. Then, by the meanvalue theorem, there exists a value xi within each bin such that

f (xi)! =! (i+1)!

i!

f (x) dx. (8.23)

Consider the quantized random variable X!, which is defined by

X! = xi if i! ! X < (i + 1)!. (8.24)

!

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

Page 7: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Differential entropy - definition

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Examples

x

f(x)

ba

Page 8: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Examples

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Differential entropy - the good the bad and the ugly

Page 9: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Differential entropy - the good the bad and the ugly

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Differential entropy - multiple RVs

Page 10: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Differential entropy of a multi-variate Gaussian

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Parallels with discrete entropy....

SUMMARY 41

Proof: We have

2!H(p)!D(p||r) = 2!

p(x) log p(x)+!

p(x) log r(x)p(x) (2.151)

= 2!

p(x) log r(x) (2.152)

""

p(x)2log r(x) (2.153)

="

p(x)r(x) (2.154)

= Pr(X = X#), (2.155)

where the inequality follows from Jensen’s inequality and the convexityof the function f (y) = 2y . !

The following telegraphic summary omits qualifying conditions.

SUMMARY

Definition The entropy H(X) of a discrete random variable X isdefined by

H(X) = !"

x$Xp(x) log p(x). (2.156)

Properties of H

1. H(X) % 0.2. Hb(X) = (logb a)Ha(X).

3. (Conditioning reduces entropy) For any two random variables, Xand Y , we have

H(X|Y) " H(X) (2.157)

with equality if and only if X and Y are independent.4. H(X1, X2, . . . , Xn) "

!ni=1 H(Xi), with equality if and only if the

Xi are independent.5. H(X) " log | X |, with equality if and only if X is distributed uni-

formly over X.6. H(p) is concave in p.

!

!

....

....

....

....

Page 11: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Parallels with discrete entropy....42 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Definition The relative entropy D(p ! q) of the probability massfunction p with respect to the probability mass function q is defined by

D(p ! q) =!

xp(x) log

p(x)

q(x). (2.158)

Definition The mutual information between two random variables Xand Y is defined as

I (X;Y) =!

x"X

!

y"Yp(x, y) log

p(x, y)

p(x)p(y). (2.159)

Alternative expressions

H(X) = Ep log1

p(X), (2.160)

H(X, Y ) = Ep log1

p(X, Y ), (2.161)

H(X|Y) = Ep log1

p(X|Y), (2.162)

I (X;Y) = Ep logp(X, Y )

p(X)p(Y ), (2.163)

D(p||q) = Ep logp(X)

q(X). (2.164)

Properties of D and I

1. I (X;Y) = H(X) # H(X|Y) = H(Y) # H(Y |X) = H(X) +H(Y) # H(X, Y ).

2. D(p ! q) $ 0 with equality if and only if p(x) = q(x), for all x "X.

3. I (X;Y) = D(p(x, y)||p(x)p(y)) $ 0, with equality if and only ifp(x, y) = p(x)p(y) (i.e., X and Y are independent).

4. If | X |= m, and u is the uniform distribution over X, then D(p !u) = log m # H(p).

5. D(p||q) is convex in the pair (p, q).

Chain rulesEntropy: H(X1, X2, . . . , Xn) =

"ni=1 H(Xi |Xi#1, . . . , X1).

Mutual information:I (X1, X2, . . . , Xn; Y) =

"ni=1 I (Xi;Y |X1, X2, . . . , Xi#1).

PROBLEMS 43

Relative entropy:D(p(x, y)||q(x, y)) = D(p(x)||q(x)) + D(p(y|x)||q(y|x)).

Jensen’s inequality. If f is a convex function, then Ef (X) ! f (EX).

Log sum inequality. For n positive numbers, a1, a2, . . . , an andb1, b2, . . . , bn,

n!

i=1

ai logai

bi!

"n!

i=1

ai

#log

$ni=1 ai$ni=1 bi

(2.165)

with equality if and only if aibi

= constant.

Data-processing inequality. If X " Y " Z forms a Markov chain,I (X;Y) ! I (X;Z).

Sufficient statistic. T (X) is sufficient relative to {f! (x)} if and onlyif I (!;X) = I (!; T (X)) for all distributions on ! .

Fano’s inequality. Let Pe = Pr{X̂(Y ) #= X}. Then

H(Pe) + Pe log |X| ! H(X|Y). (2.166)

Inequality. If X and X$ are independent and identically distributed,then

Pr(X = X$) ! 2%H(X), (2.167)

PROBLEMS

2.1 Coin flips . A fair coin is flipped until the first head occurs. LetX denote the number of flips required.(a) Find the entropy H(X) in bits. The following expressions may

be useful:

&!

n=0

rn = 11 % r

,

&!

n=0

nrn = r

(1 % r)2 .

(b) A random variable X is drawn according to this distribution.Find an “efficient” sequence of yes–no questions of the form,

....

....

....

....

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Parallels with discrete entropy....

42 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Definition The relative entropy D(p ! q) of the probability massfunction p with respect to the probability mass function q is defined by

D(p ! q) =!

xp(x) log

p(x)

q(x). (2.158)

Definition The mutual information between two random variables Xand Y is defined as

I (X;Y) =!

x"X

!

y"Yp(x, y) log

p(x, y)

p(x)p(y). (2.159)

Alternative expressions

H(X) = Ep log1

p(X), (2.160)

H(X, Y ) = Ep log1

p(X, Y ), (2.161)

H(X|Y) = Ep log1

p(X|Y), (2.162)

I (X;Y) = Ep logp(X, Y )

p(X)p(Y ), (2.163)

D(p||q) = Ep logp(X)

q(X). (2.164)

Properties of D and I

1. I (X;Y) = H(X) # H(X|Y) = H(Y) # H(Y |X) = H(X) +H(Y) # H(X, Y ).

2. D(p ! q) $ 0 with equality if and only if p(x) = q(x), for all x "X.

3. I (X;Y) = D(p(x, y)||p(x)p(y)) $ 0, with equality if and only ifp(x, y) = p(x)p(y) (i.e., X and Y are independent).

4. If | X |= m, and u is the uniform distribution over X, then D(p !u) = log m # H(p).

5. D(p||q) is convex in the pair (p, q).

Chain rulesEntropy: H(X1, X2, . . . , Xn) =

"ni=1 H(Xi |Xi#1, . . . , X1).

Mutual information:I (X1, X2, . . . , Xn; Y) =

"ni=1 I (Xi;Y |X1, X2, . . . , Xi#1).

....

....

....

....

Page 12: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Parallels with discrete entropy...PROBLEMS 43

Relative entropy:D(p(x, y)||q(x, y)) = D(p(x)||q(x)) + D(p(y|x)||q(y|x)).

Jensen’s inequality. If f is a convex function, then Ef (X) ! f (EX).

Log sum inequality. For n positive numbers, a1, a2, . . . , an andb1, b2, . . . , bn,

n!

i=1

ai logai

bi!

"n!

i=1

ai

#log

$ni=1 ai$ni=1 bi

(2.165)

with equality if and only if aibi

= constant.

Data-processing inequality. If X " Y " Z forms a Markov chain,I (X;Y) ! I (X;Z).

Sufficient statistic. T (X) is sufficient relative to {f! (x)} if and onlyif I (!;X) = I (!; T (X)) for all distributions on ! .

Fano’s inequality. Let Pe = Pr{X̂(Y ) #= X}. Then

H(Pe) + Pe log |X| ! H(X|Y). (2.166)

Inequality. If X and X$ are independent and identically distributed,then

Pr(X = X$) ! 2%H(X), (2.167)

PROBLEMS

2.1 Coin flips . A fair coin is flipped until the first head occurs. LetX denote the number of flips required.(a) Find the entropy H(X) in bits. The following expressions may

be useful:

&!

n=0

rn = 11 % r

,

&!

n=0

nrn = r

(1 % r)2 .

(b) A random variable X is drawn according to this distribution.Find an “efficient” sequence of yes–no questions of the form,

....

....

....

....

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Relative entropy and mutual information

Page 13: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Properties

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

ASIDE: A general definition of mutual information

252 DIFFERENTIAL ENTROPY

Definition The mutual information between two random variables Xand Y is given by

I (X;Y) = supP,Q

I ([X]P; [Y ]Q), (8.54)

where the supremum is over all finite partitions P and Q.This is the master definition of mutual information that always applies,

even to joint distributions with atoms, densities, and singular parts. More-over, by continuing to refine the partitions P and Q, one finds a mono-tonically increasing sequence I ([X]P ; [Y ]Q) ! I .

By arguments similar to (8.52), we can show that this definition ofmutual information is equivalent to (8.47) for random variables that havea density. For discrete random variables, this definition is equivalent tothe definition of mutual information in (2.28).

Example 8.5.1 (Mutual information between correlated Gaussian ran-dom variables with correlation !) Let (X, Y ) " N(0, K), where

K =!

" 2 !" 2

!" 2 " 2

". (8.55)

Then h(X) = h(Y ) = 12 log(2#e)" 2 and h(X, Y ) = 1

2 log(2#e)2|K| =12 log(2#e)2" 4(1 # !2), and therefore

I (X;Y) = h(X) + h(Y ) # h(X, Y ) = #12

log(1 # !2). (8.56)

If ! = 0, X and Y are independent and the mutual information is 0.If ! = ±1, X and Y are perfectly correlated and the mutual informationis infinite.

8.6 PROPERTIES OF DIFFERENTIAL ENTROPY, RELATIVEENTROPY, AND MUTUAL INFORMATION

Theorem 8.6.1

D(f ||g) $ 0 (8.57)

with equality iff f = g almost everywhere (a.e.).

Proof: Let S be the support set of f . Then

#D(f ||g) =#

S

f logg

f(8.58)

% log#

S

fg

f(by Jensen’s inequality) (8.59)

Page 14: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

A quick example

• Find the mutual information between the correlated Gaussian random

variables with correlation coefficient "

• What is I(X;Y)?

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

More properties of differential entropy

Page 15: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

More properties of differential entropy

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Examples of changes in variables

Page 16: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Concavity and convexity

• Same as in the discrete entropy and mutual information....

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Maximum entropy distributions

• For a discrete random variable taking on K values, what distribution

maximized the entropy?

• Can you think of a continuous counter-part?

[Look ahead to Ch.12, pg. 409-412]

Page 17: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Maximum entropy distributions

[Look ahead to Ch.12, pg. 409-412]

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Maximum entropy examples

Page 18: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Maximum entropy examples

[Look ahead to Ch.12, pg. 409-412]

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Estimation error and differential entropy

• A counter part to Fano’s inequality for discrete RVs...

38 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

calculate a function g(Y ) = X̂, where X̂ is an estimate of X and takes onvalues in X̂. We will not restrict the alphabet X̂ to be equal to X, and wewill also allow the function g(Y ) to be random. We wish to bound theprobability that X̂ != X. We observe that X " Y " X̂ forms a Markovchain. Define the probability of error

Pe = Pr!X̂ != X

". (2.129)

Theorem 2.10.1 (Fano’s Inequality) For any estimator X̂ such thatX " Y " X̂, with Pe = Pr(X != X̂), we have

H(Pe) + Pe log |X| # H(X|X̂) # H(X|Y). (2.130)

This inequality can be weakened to

1 + Pe log |X| # H(X|Y) (2.131)

orPe # H(X|Y) $ 1

log |X|. (2.132)

Remark Note from (2.130) that Pe = 0 implies that H(X|Y) = 0, asintuition suggests.

Proof: We first ignore the role of Y and prove the first inequality in(2.130). We will then use the data-processing inequality to prove the moretraditional form of Fano’s inequality, given by the second inequality in(2.130). Define an error random variable,

E =#

1 if X̂ != X,

0 if X̂ = X.(2.133)

Then, using the chain rule for entropies to expand H(E, X|X̂) in twodifferent ways, we have

H(E, X|X̂) = H(X|X̂) + H(E|X, X̂)$ %& '=0

(2.134)

= H(E|X̂)$ %& '%H(Pe)

+ H(X|E, X̂)$ %& '%Pe log |X |

. (2.135)

Since conditioning reduces entropy, H(E|X̂) % H(E) = H(Pe). Nowsince E is a function of X and X̂, the conditional entropy H(E|X, X̂) is

Why can’t we use Fano’s?

Page 19: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Estimation error and differential entropy

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

The AEP for continuous RVs

• The AEP for discrete RVs said.....

58 ASYMPTOTIC EQUIPARTITION PROPERTY

probability distribution. Here it turns out that p(X1, X2, . . . , Xn) is closeto 2!nH with high probability.

We summarize this by saying, “Almost all events are almost equallysurprising.” This is a way of saying that

Pr!(X1, X2, . . . , Xn) : p(X1, X2, . . . , Xn) = 2!n(H±!)

"" 1 (3.1)

if X1, X2, . . . , Xn are i.i.d. # p(x).In the example just given, where p(X1, X2, . . . , Xn) = p

#Xiqn!

#Xi ,

we are simply saying that the number of 1’s in the sequence is closeto np (with high probability), and all such sequences have (roughly) thesame probability 2!nH(p). We use the idea of convergence in probability,defined as follows:

Definition (Convergence of random variables). Given a sequence ofrandom variables, X1, X2, . . . , we say that the sequence X1, X2, . . . con-verges to a random variable X:

1. In probability if for every ! > 0, Pr{|Xn ! X| > !} $ 02. In mean square if E(Xn ! X)2 $ 03. With probability 1 (also called almost surely) if Pr{limn$% Xn =

X} = 1

3.1 ASYMPTOTIC EQUIPARTITION PROPERTY THEOREM

The asymptotic equipartition property is formalized in the followingtheorem.

Theorem 3.1.1 (AEP) If X1, X2, . . . are i.i.d. # p(x), then

!1n

log p(X1, X2, . . . , Xn) $ H(X) in probability. (3.2)

Proof: Functions of independent random variables are also independentrandom variables. Thus, since the Xi are i.i.d., so are log p(Xi). Hence,by the weak law of large numbers,

!1n

log p(X1, X2, . . . , Xn) = !1n

$

i

log p(Xi) (3.3)

$ !E log p(X) in probability (3.4)

= H(X), (3.5)

which proves the theorem. !

• The AEP for continuous RVs says.....

Page 20: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Typical sets

• One of the points of the AEP is to define typical sets.

• Typical set for discrete RVs...

• Typical set of continuous RVs....

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Typical sets and volumes

Page 21: Chapter 8: Differential entropy - UIC - Electrical and ...devroye/courses/ECE534/lectures/ch8.pdfUniversity of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Motivation •need

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Summary256 DIFFERENTIAL ENTROPY

SUMMARY

h(X) = h(f ) = !!

S

f (x) log f (x) dx (8.81)

f (Xn).=2!nh(X) (8.82)

Vol(A(n)! )

.=2nh(X). (8.83)

H ([X]2!n) " h(X) + n. (8.84)

h(N(0, " 2)) = 12

log 2#e" 2. (8.85)

h(Nn(µ, K)) = 12

log(2#e)n|K|. (8.86)

D(f ||g) =!

f logf

g# 0. (8.87)

h(X1, X2, . . . , Xn) =n"

i=1

h(Xi |X1, X2, . . . , Xi!1). (8.88)

h(X|Y) $ h(X). (8.89)

h(aX) = h(X) + log |a|. (8.90)

I (X;Y) =!

f (x, y) logf (x, y)

f (x)f (y)# 0. (8.91)

maxEXXt=K

h(X) = 12

log(2#e)n|K|. (8.92)

E(X ! X̂(Y ))2 # 12#e

e2h(X|Y).

2nH(X) is the effective alphabet size for a discrete random variable.2nh(X) is the effective support set size for a continuous random variable.2C is the effective alphabet size of a channel of capacity C.

PROBLEMS

8.1 Differential entropy . Evaluate the differential entropy h(X) =!

#f ln f for the following:

(a) The exponential density, f (x) = $e!$x , x # 0.

University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye

Summary

256 DIFFERENTIAL ENTROPY

SUMMARY

h(X) = h(f ) = !!

S

f (x) log f (x) dx (8.81)

f (Xn).=2!nh(X) (8.82)

Vol(A(n)! )

.=2nh(X). (8.83)

H ([X]2!n) " h(X) + n. (8.84)

h(N(0, " 2)) = 12

log 2#e" 2. (8.85)

h(Nn(µ, K)) = 12

log(2#e)n|K|. (8.86)

D(f ||g) =!

f logf

g# 0. (8.87)

h(X1, X2, . . . , Xn) =n"

i=1

h(Xi |X1, X2, . . . , Xi!1). (8.88)

h(X|Y) $ h(X). (8.89)

h(aX) = h(X) + log |a|. (8.90)

I (X;Y) =!

f (x, y) logf (x, y)

f (x)f (y)# 0. (8.91)

maxEXXt=K

h(X) = 12

log(2#e)n|K|. (8.92)

E(X ! X̂(Y ))2 # 12#e

e2h(X|Y).

2nH(X) is the effective alphabet size for a discrete random variable.2nh(X) is the effective support set size for a continuous random variable.2C is the effective alphabet size of a channel of capacity C.

PROBLEMS

8.1 Differential entropy . Evaluate the differential entropy h(X) =!

#f ln f for the following:

(a) The exponential density, f (x) = $e!$x , x # 0.