0202 fmc3

6
ECE 562: Information Theory Spring 2006 Lecture 4 — February 2 Lecturer: Sergio D. Servetto Scribe: Frank Ciaramello 4.1 Some Useful Information Inequalities This section proves some useful inequalities that will be used often. First, we will show that conditioning a random variable cannot increase its entropy. In- tuitively, this makes sense. The act of conditioning adds information about a particular random variable. Therefore, the uncertainty must go down (or stay the same, if the condi- tioning added no information, i.e. the conditioning variable is independent of the random variable.) Theorem 4.1. “Conditioning Does Not Increase Entropy” H (X |Y ) H (X ); for any random variable X, Y (4.1) Proof: H (X |Y ) = H (X ) I (X ; Y ) I (X ; Y ) 0 H (X |Y ) H (X ) Two results that can be taken from theorem 4.1 are that equality holds only for the case of independence and that we can condition on more than one random variable: 1. H (X |Y )= H (X ) ⇐⇒ X and Y are independent 2. H (X |YZ ) H (X |Y ) H (X ) The next inequality we will prove shows that mutual entropy can be upper bounded by the case when each random variable is independent. This means that dependence among random variables decreases entropy. We can prove it using two different methods. 4-1

Upload: milkers

Post on 18-Jul-2015

25 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: 0202 fmc3

ECE 562: Information Theory Spring 2006

Lecture 4 — February 2

Lecturer: Sergio D. Servetto Scribe: Frank Ciaramello

4.1 Some Useful Information Inequalities

This section proves some useful inequalities that will be used often.First, we will show that conditioning a random variable cannot increase its entropy. In-

tuitively, this makes sense. The act of conditioning adds information about a particularrandom variable. Therefore, the uncertainty must go down (or stay the same, if the condi-tioning added no information, i.e. the conditioning variable is independent of the randomvariable.)

Theorem 4.1. “Conditioning Does Not Increase Entropy”

H(X|Y ) ≤ H(X); for any random variable X, Y (4.1)

Proof:

H(X|Y ) = H(X) − I(X; Y )

I(X; Y ) ≥ 0

∴ H(X|Y ) ≤ H(X)

Two results that can be taken from theorem 4.1 are that equality holds only for the caseof independence and that we can condition on more than one random variable:

1. H(X|Y ) = H(X) ⇐⇒ X and Y are independent

2. H(X|Y Z) ≤ H(X|Y ) ≤ H(X)

The next inequality we will prove shows that mutual entropy can be upper bounded bythe case when each random variable is independent. This means that dependence amongrandom variables decreases entropy. We can prove it using two different methods.

4-1

Page 2: 0202 fmc3

ECE 562 Lecture 4 — February 2 Spring 2006

Theorem 4.2. “Independence Bound”

H(X1, X2, ..., Xn) ≤n∑

i=1

H(Xi) (4.2)

Proof: Method 1 uses the chain rule for entropy.

H(X1, X2, ..., Xn) =n∑

i=1

H(Xi|X1...Xi−1) ≤n∑

i=1

H(Xi)

Proof: Method 2 expands the entropies and relates them to a relative entropy, or divergence.

n∑i=1

H(Xi) − H(X1, ..., Xn) = −n∑

i=1

E(log p(Xi)) + E(log p(X1...Xn))

= −E(log p(X1)...p(Xn)) + E(log p(X1...Xn))

= E(logp(X1...Xn)

p(X1)...p(Xn))

= D(p(X1...Xn)||p(X1)...p(Xn)) ≥ 0

4.2 Data Processing Inequality

This section provides the necessary theorems and lemmas to prove the data processing in-equality.

Theorem 4.3.I(X; Y, Z) ≥ I(X; Y ) (4.3)

equality holds ⇐⇒ X-Y-Z forms a Markov chain.

Proof: Using the chain rule for mutual information, we show

I(X; Y, Z) = I(X; Y ) + I(X; Z|Y )︸ ︷︷ ︸≥0

≥ I(X; Y )

4-2

Page 3: 0202 fmc3

ECE 562 Lecture 4 — February 2 Spring 2006

The following theorem, theorem 4.4 shows that the closer in the Markov Chain thevariables are, the more information they share between them. I.e. variables that are farapart are closer to being independent.

Theorem 4.4. X-Y-Z forms a Markov Chain ⇐⇒I(X; Z) ≤ I(X; Y ) (4.4)

I(X; Z) ≤ I(Y ; Z) (4.5)

Proof: Prove by expanding mutual information in two different ways.

I(X; Y, Z) = I(X; Z) + I(X; Y |Z)

I(X; Y, Z) = I(X; Y ) + I(X; Z|Y )

By the definition of a Markov chain, X⊥Z|Y , therefore, I(X; Z|Y ) = 0 and

I(X; Y ) = I(X; Z) + I(X; Y |Z)

Mutual information is always greater than or equal to zero, therefore

I(X; Y ) ≥ I(X; Z)

Since X-Y-Z is equivalent to Z-Y-X, the same method can be used to prove (4.5)�

Theorem 4.5. “Data Processing Inequality”

If U-X-Y-V is a Markov Chain, then

I(U ; V ) ≤ I(X; Y ) (4.6)

Proof: Since U-X-Y-V is a MC, then U-X-Y and X-Y-V are MCs. The proof follows simplyfrom theorem 4.4

I(U ; Y ) ≤ I(X; Y )

I(U ; V ) ≤ I(U ; Y )

∴ I(U ; V ) ≤ I(X; Y )

The data processing inequality shows us that if we want to infer X using Y, the best wecan do is simply to use an unprocessed version of Y. By processing Y (either deterministicallyor probabilistically), we increase uncertainty in X, given the processed version of Y.

4-3

Page 4: 0202 fmc3

ECE 562 Lecture 4 — February 2 Spring 2006

4.3 Fano’s Inequality

The following are lemmas and definitions required for Fano’s Inequality.

Lemma 4.6 shows that the entropy of a random variable is always less than or equal tothe log of the size of its alphabet.

Lemma 4.6.H(X) ≤ log |X | (4.7)

equality holds

⇐⇒ P (X = x) =1

|X | , ∀x

Proof: We prove this by expanding the terms into their summations and relating them toa relative entropy measure.

log |X | − H(X) = −∑x∈X

p(x) log |X |−1 +∑x∈X

p(x) log p(x)

= −∑x∈X

p(x) log u(x) +∑x∈X

p(x) log p(x); u(x) =1

|X |

=∑x∈X

p(x) logp(x)

u(x)

=D(p(x)||q(x)) ≥ 0

One consequence is that equality holds if and only if p(x) = u(x), i.e. p(x) is a uniformdistribution. This says that entropy is maximum when all outcomes are equally likely. Thismakes sense, intuitively, since entropy measures uncertainty in the random variable X.

Another consequence is the following corollary:

Corollary 4.7. H(X) can be any non-negative real number.

Proof: The proof follows from the intermediate value theorem. We know that H(X) = 0for a deterministic signal and H(X) = log |X | for a uniform distribution. For any value0 < a < log |X |, ∃X such that H(X) = a. For |X | sufficiently large, H(X) can take anypositive value. �

Theorem 4.8. “Fano’s Inequality”

4-4

Page 5: 0202 fmc3

ECE 562 Lecture 4 — February 2 Spring 2006

First, we define Pe, probability of error:Let X, X̂ two random variables on X

Pe = P (X �= X̂) (4.8)

Fano’s Inequality:

H(X|X̂) ≤ hb(Pe) + Pe log |X | − 1 (4.9)

Proof: We will prove Fano’s inequality by expanding the entropy and by using theorem 4.1.

Define an indicator function: Y =

{0, X = X̂

1, X �= X̂Note:

p(Y = 1) = Pe

p(Y = 0) = 1 − Pe

H(Y ) = hb(Pe)H(Y |X, X̂) = 0

H(X|X̂) = I(X; Y |X̂) + H(X|X̂, Y )

= H(X|X̂) − H(X|X̂, Y ) + H(X|X̂, Y )

= H(Y |X̂) − H(Y |X̂, X) + H(X|X̂, Y )

= H(Y |X̂) + H(X|X̂, Y )

≤ H(Y ) + H(X|X̂, Y )

= H(Y ) +∑x̂∈X

[P (X̂ = x̂, Y = 0)H(X|X̂ = x̂, Y = 0)

+ P (X̂ = x̂, Y = 1)H(X|X̂ = x̂, Y = 1)]

Since Y = 0, X = X̂. Therefore, the first term in the summation is 0:

H(X|X̂ = x̂, Y = 0) = 0 (4.10)

Lemma 4.6 says that the entropy is less than or equal to the log of the size of the alphabetof a random variable. Since we know that X �= X̂(Y = 1) then the possible alphabet sizefor X is |X | − 1, the original minus the value that X̂ has taken. Therefore,

H(X|X̂ = x̂, Y = 1) ≤ log(|X | − 1) (4.11)

4-5

Page 6: 0202 fmc3

ECE 562 Lecture 4 — February 2 Spring 2006

Using (4.10) and (4.11), we can show

H(X|X̂) ≤ hb(Pe) + log(|X | − 1)∑x̂∈X

P (X̂ = x̂, Y = 1)

︸ ︷︷ ︸p(Y =1)=Pe

H(X|X̂) ≤ hb(Pe) + Pe log(|X | − 1)

Corollary 4.9. Weak Fano’s Inequality

H(X|X̂) ≤ 1 + Pe log |X | (4.12)

Proof: We know that binary entropy is upper bounded by 1 and that log is an increasingfunction in X , therefore log(|X | − 1) ≤ log(|X |). The corollary is proven. �

4-6