0202 fmc3
TRANSCRIPT
ECE 562: Information Theory Spring 2006
Lecture 4 — February 2
Lecturer: Sergio D. Servetto Scribe: Frank Ciaramello
4.1 Some Useful Information Inequalities
This section proves some useful inequalities that will be used often.First, we will show that conditioning a random variable cannot increase its entropy. In-
tuitively, this makes sense. The act of conditioning adds information about a particularrandom variable. Therefore, the uncertainty must go down (or stay the same, if the condi-tioning added no information, i.e. the conditioning variable is independent of the randomvariable.)
Theorem 4.1. “Conditioning Does Not Increase Entropy”
H(X|Y ) ≤ H(X); for any random variable X, Y (4.1)
Proof:
H(X|Y ) = H(X) − I(X; Y )
I(X; Y ) ≥ 0
∴ H(X|Y ) ≤ H(X)
�
Two results that can be taken from theorem 4.1 are that equality holds only for the caseof independence and that we can condition on more than one random variable:
1. H(X|Y ) = H(X) ⇐⇒ X and Y are independent
2. H(X|Y Z) ≤ H(X|Y ) ≤ H(X)
The next inequality we will prove shows that mutual entropy can be upper bounded bythe case when each random variable is independent. This means that dependence amongrandom variables decreases entropy. We can prove it using two different methods.
4-1
ECE 562 Lecture 4 — February 2 Spring 2006
Theorem 4.2. “Independence Bound”
H(X1, X2, ..., Xn) ≤n∑
i=1
H(Xi) (4.2)
Proof: Method 1 uses the chain rule for entropy.
H(X1, X2, ..., Xn) =n∑
i=1
H(Xi|X1...Xi−1) ≤n∑
i=1
H(Xi)
�
Proof: Method 2 expands the entropies and relates them to a relative entropy, or divergence.
n∑i=1
H(Xi) − H(X1, ..., Xn) = −n∑
i=1
E(log p(Xi)) + E(log p(X1...Xn))
= −E(log p(X1)...p(Xn)) + E(log p(X1...Xn))
= E(logp(X1...Xn)
p(X1)...p(Xn))
= D(p(X1...Xn)||p(X1)...p(Xn)) ≥ 0
�
4.2 Data Processing Inequality
This section provides the necessary theorems and lemmas to prove the data processing in-equality.
Theorem 4.3.I(X; Y, Z) ≥ I(X; Y ) (4.3)
equality holds ⇐⇒ X-Y-Z forms a Markov chain.
Proof: Using the chain rule for mutual information, we show
I(X; Y, Z) = I(X; Y ) + I(X; Z|Y )︸ ︷︷ ︸≥0
≥ I(X; Y )
�
4-2
ECE 562 Lecture 4 — February 2 Spring 2006
The following theorem, theorem 4.4 shows that the closer in the Markov Chain thevariables are, the more information they share between them. I.e. variables that are farapart are closer to being independent.
Theorem 4.4. X-Y-Z forms a Markov Chain ⇐⇒I(X; Z) ≤ I(X; Y ) (4.4)
I(X; Z) ≤ I(Y ; Z) (4.5)
Proof: Prove by expanding mutual information in two different ways.
I(X; Y, Z) = I(X; Z) + I(X; Y |Z)
I(X; Y, Z) = I(X; Y ) + I(X; Z|Y )
By the definition of a Markov chain, X⊥Z|Y , therefore, I(X; Z|Y ) = 0 and
I(X; Y ) = I(X; Z) + I(X; Y |Z)
Mutual information is always greater than or equal to zero, therefore
I(X; Y ) ≥ I(X; Z)
Since X-Y-Z is equivalent to Z-Y-X, the same method can be used to prove (4.5)�
Theorem 4.5. “Data Processing Inequality”
If U-X-Y-V is a Markov Chain, then
I(U ; V ) ≤ I(X; Y ) (4.6)
Proof: Since U-X-Y-V is a MC, then U-X-Y and X-Y-V are MCs. The proof follows simplyfrom theorem 4.4
I(U ; Y ) ≤ I(X; Y )
I(U ; V ) ≤ I(U ; Y )
∴ I(U ; V ) ≤ I(X; Y )
�
The data processing inequality shows us that if we want to infer X using Y, the best wecan do is simply to use an unprocessed version of Y. By processing Y (either deterministicallyor probabilistically), we increase uncertainty in X, given the processed version of Y.
4-3
ECE 562 Lecture 4 — February 2 Spring 2006
4.3 Fano’s Inequality
The following are lemmas and definitions required for Fano’s Inequality.
Lemma 4.6 shows that the entropy of a random variable is always less than or equal tothe log of the size of its alphabet.
Lemma 4.6.H(X) ≤ log |X | (4.7)
equality holds
⇐⇒ P (X = x) =1
|X | , ∀x
Proof: We prove this by expanding the terms into their summations and relating them toa relative entropy measure.
log |X | − H(X) = −∑x∈X
p(x) log |X |−1 +∑x∈X
p(x) log p(x)
= −∑x∈X
p(x) log u(x) +∑x∈X
p(x) log p(x); u(x) =1
|X |
=∑x∈X
p(x) logp(x)
u(x)
=D(p(x)||q(x)) ≥ 0
�
One consequence is that equality holds if and only if p(x) = u(x), i.e. p(x) is a uniformdistribution. This says that entropy is maximum when all outcomes are equally likely. Thismakes sense, intuitively, since entropy measures uncertainty in the random variable X.
Another consequence is the following corollary:
Corollary 4.7. H(X) can be any non-negative real number.
Proof: The proof follows from the intermediate value theorem. We know that H(X) = 0for a deterministic signal and H(X) = log |X | for a uniform distribution. For any value0 < a < log |X |, ∃X such that H(X) = a. For |X | sufficiently large, H(X) can take anypositive value. �
Theorem 4.8. “Fano’s Inequality”
4-4
ECE 562 Lecture 4 — February 2 Spring 2006
First, we define Pe, probability of error:Let X, X̂ two random variables on X
Pe = P (X �= X̂) (4.8)
Fano’s Inequality:
H(X|X̂) ≤ hb(Pe) + Pe log |X | − 1 (4.9)
Proof: We will prove Fano’s inequality by expanding the entropy and by using theorem 4.1.
Define an indicator function: Y =
{0, X = X̂
1, X �= X̂Note:
p(Y = 1) = Pe
p(Y = 0) = 1 − Pe
H(Y ) = hb(Pe)H(Y |X, X̂) = 0
H(X|X̂) = I(X; Y |X̂) + H(X|X̂, Y )
= H(X|X̂) − H(X|X̂, Y ) + H(X|X̂, Y )
= H(Y |X̂) − H(Y |X̂, X) + H(X|X̂, Y )
= H(Y |X̂) + H(X|X̂, Y )
≤ H(Y ) + H(X|X̂, Y )
= H(Y ) +∑x̂∈X
[P (X̂ = x̂, Y = 0)H(X|X̂ = x̂, Y = 0)
+ P (X̂ = x̂, Y = 1)H(X|X̂ = x̂, Y = 1)]
Since Y = 0, X = X̂. Therefore, the first term in the summation is 0:
H(X|X̂ = x̂, Y = 0) = 0 (4.10)
Lemma 4.6 says that the entropy is less than or equal to the log of the size of the alphabetof a random variable. Since we know that X �= X̂(Y = 1) then the possible alphabet sizefor X is |X | − 1, the original minus the value that X̂ has taken. Therefore,
H(X|X̂ = x̂, Y = 1) ≤ log(|X | − 1) (4.11)
4-5
ECE 562 Lecture 4 — February 2 Spring 2006
Using (4.10) and (4.11), we can show
H(X|X̂) ≤ hb(Pe) + log(|X | − 1)∑x̂∈X
P (X̂ = x̂, Y = 1)
︸ ︷︷ ︸p(Y =1)=Pe
H(X|X̂) ≤ hb(Pe) + Pe log(|X | − 1)
�
Corollary 4.9. Weak Fano’s Inequality
H(X|X̂) ≤ 1 + Pe log |X | (4.12)
Proof: We know that binary entropy is upper bounded by 1 and that log is an increasingfunction in X , therefore log(|X | − 1) ≤ log(|X |). The corollary is proven. �
4-6