chapter02 entropy
Post on 03-Jun-2018
252 Views
Preview:
TRANSCRIPT
-
8/12/2019 Chapter02 Entropy
1/97
Chapter 2: Entropy, Relative Entropy, andMutual Information
Xiaojun Hei
Internet Technology and Engineering R&D CenterDepartment of Electronics and Information Engineering
Email: heixj@hust.edu.cnWeb: http://itec.hust.edu.cn/heixj
Phone: 027-87544704
-
8/12/2019 Chapter02 Entropy
2/97
Chapter 2: Entropy, Relative Entropy, and Mutual Information
Entropy
Joint and conditional entropy
Relative entropy and mutual information
Chain rules
Jensens inequality
Log sum inequality
Data processing inequality
-
8/12/2019 Chapter02 Entropy
3/97
Block diagram of communication systems
The transmission and process of information in communicationsystems
Data compression limit, provided by Shannons first theorem
Commonly-used coding algorithms forzero-errorsource coding
-
8/12/2019 Chapter02 Entropy
4/97
Source encoder side
The output sequence of information source is stochastic: howto characterize?
We can think of a discrete source as generating the
message,symbol by symbol...a mathematical model of ssystem... is known as astochastic process. C.E. Shannon
-
8/12/2019 Chapter02 Entropy
5/97
Source coding
C
:X
D
:C
(x
),
where D is the set of finite length strings of symbols from a D-aryalphabet1.
Let C(x) denote the codeword corresponding to x.
Let l(x) denote the length ofC(x).
Expected lengthof a source code
L(C) =
xXp(x)l(x)
ExampleX ={red, blue},D ={0,1}, C(red)=0, C(blue)=1
1When D= 2, it is binary, in which the alphabet is {0,1}.
-
8/12/2019 Chapter02 Entropy
6/97
Outcomes of the source
Single outcome or outcome sequence
Continuous or Discrete
-
8/12/2019 Chapter02 Entropy
7/97
Modeling single outcome
Continuous Source Rp(x)
,
R
p(x)dx= 1
Discrete Source
X
P(X)
=
a1, a2, aqP(a1), P(a2), P(aq)
,
qi=1
P(ai) = 1
-
8/12/2019 Chapter02 Entropy
8/97
Modeling outcome sequence
Waveform Source Continuous in both time and amplitude Modeled as a continuous stochastic process {x(t)}
Sequence Source
Sampled from waveform source Discrete in time or space Modeled as a stochastic sequence {Xi(ti)}
-
8/12/2019 Chapter02 Entropy
9/97
Classification of sources
Stationary: whether the distribution changes with time?
Stationary Source: goodsource (easy to analyze) Unstationary source:sometimes can be simplified as Markov
source Memory: whether the variables in sequence have relationship?
Source without memory: goodsource(easy to analyze) Source with memory: can be modeled as Markov source
-
8/12/2019 Chapter02 Entropy
10/97
Sources studied in our course
Motivation: we study the ideal sources withgood properties, thenuse them to approximate real sources Discrete Source
Single OutcomeDiscrete Source Outcome sequenceDiscrete Source Discrete stationary memoryless source Discrete stationary source with memory
Continuous source
Waveform source
-
8/12/2019 Chapter02 Entropy
11/97
Diagram of communication systems
The transmission and process of information in communicationsystems
-
8/12/2019 Chapter02 Entropy
12/97
Information source model
Notations Sample space:X Random variable (r.v.):X Outcome ofXor realization ofX: x Cardinality of set X (the number of elements): |X |
Probability mass function (p.m.f.) P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y
-
8/12/2019 Chapter02 Entropy
13/97
Information properties
At first, we investigate the measure of information forsingleoutcome. (Property 1)
It should have the following properties: The larger the measure, the more surprising the outcome
(Property 2) is Function of the probability distribution Proportional to the inverse of probability None-linear mapping from probability to information
([0, 1] [0,))
Information content of two independent r.v.s is the sum of
information contents (Property 3) Logarithm of the probability
-
8/12/2019 Chapter02 Entropy
14/97
Definition
Theself-informationof a realization x ofr.v.Xcan be definedas:
I(x) = log[p(x)] = log 1p(x)it can be proved that this is the only formsatisfying theproperties of information
The base of logarithm can be any 2: information measured in binary units (bits) e: information measured in natural units (nats)
-
8/12/2019 Chapter02 Entropy
15/97
Self-information: example
Given a source with Mpossible outcomes Then the self-information of each outcome is k= log 2M bits
This means the outcome can be described by k bitsinformation
For instance, a source has M= 4 outcomes {a, b, c, d}, thenthe outcome can be described by 2 = log 24 bits code , suchas set{00, 01, 10, 11}.
a 00
b 01c 10d 11
-
8/12/2019 Chapter02 Entropy
16/97
Entropy definition
Theaverage informationofr.v.X is called theentropyofX
H(X) = xX
p(x)log[p(x)]
A (convenient)measure of uncertaintyof the r.v..
Entropy is a function of the probability distribution Independent of the outcomes of the r.v. itself Only the distribution matters
Logarithm Base can by any, i.e., 2 by default. Baseb is sometimes marked as Hb(X). By the continuity argument: ifp(x) = 0,
p(x) log
1p(x)
= 0 log
10
= 0, in that thezero
probability has no impact on entropy.
-
8/12/2019 Chapter02 Entropy
17/97
Logarithm: propertieshttp://en.wikipedia.org/wiki/Logarithm
Product, quotient, power, and root
Formula Example
product logb(xy) = logb(x) + logb(y) log3(243) = log3(9 27) = log3(9) + log3(27) = 2 + 3 = 5quotient logb
xy
= logb(x) logb(y) log2(16) = log2( 644 ) = log2(64) log2 (4) = 6 2 = 4
power logb(xp) = plogb(x) log2(64) = log2(2
6) = 6 log2(2) = 6
root logb px = logb(x)
p log10(
1000) = 12 log10(1000) =
32 = 1.5
Change of base
logb(x) = logk(x)
logk(b
)
.
Derivative and antiderivative
d
dxlogb(x) =
1
xln(b).
d
dx
ln (f(x)) = f(x)
f(x)
.
ln (x) dx= xln (x) x+C.
Integral representation of the natural logarithm
ln (t) =
t1
1
xdx.
f
-
8/12/2019 Chapter02 Entropy
18/97
Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm
The graph of the logarithm to base 2 crosses the xaxis (horizontal axis)at 1 and passes through the points with coordinates (2, 1), (4, 2), and(8, 3).
L i h f i
-
8/12/2019 Chapter02 Entropy
19/97
Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm
The graph of the logarithm function logb(x) (blue) is obtained byreflecting the graph of the function bx (red) at the diagonal line (x=y).
L i h f i
-
8/12/2019 Chapter02 Entropy
20/97
Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm
The graph of the natural logarithm (green) and its tangent at x= 1.5(black).
E t b i ti
-
8/12/2019 Chapter02 Entropy
21/97
Entropy: basic properties
Entropy is the expected value of self-information
H(X) = E{log[p(x)]} =E
log 1
p(x)
Entropy H(X) isnon-negative.
0 p(x) 1 log 1
p(x)
0
H(X) =E
log
1
p(x)
0
Change of bases, since
Hb(X) = [logba] Ha(X)
logbp= [logba] logap
E l #1 t f if
-
8/12/2019 Chapter02 Entropy
22/97
Example #1: entropy of uniform r.v.
Consider a uniform random variable with M= 32 = 25
possible outcomes
Then 5 bits are sufficient to describe each outcome
The entropy of this r.v. is
H(X) = 32i=1
p(x)log2[p(x)] =32i=1
1
32log2[
1
32] = 5bits
Entropy agrees with log2M (H(X) = log2M)
E a le #2 e t o of o ifo
-
8/12/2019 Chapter02 Entropy
23/97
Example #2: entropy of non-uniform r.v.
Consider a non-uniform random variable withM= 8 = 23 Possible
outcomes and probabilities(1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64)
Then 3 bits are sufficient to describe each outcome
The entropy of this r.v. is
H(X) = 12
log2(12
) 14
log2(14
) 18
log2(18
)
1
16log2(
1
16)
4
64log2(
1
64)
= 2bits
Entropy and log2(M) disagree (H(X)< log2M)
The average length of source code word can be made shorter than 3
Example: (0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Definition
-
8/12/2019 Chapter02 Entropy
24/97
Definition
Joint entropy
H(X,Y) = xX
yY
p(x, y)log[p(x, y)]
= E{log[p(x, y)]}
Conditional entropy
H(Y|X) =xX
p(x)H(Y|X =x)
=
xXp(x)
yYp(y|x) log[p(y|x)]
= xX
yY
p(x, y) log[p(y|x)]
= E{log[p(y|x)]}
Note that in general: H(Y|X) =H(X|Y)
Venn Diagram
-
8/12/2019 Chapter02 Entropy
25/97
Venn Diagram
Motivation
-
8/12/2019 Chapter02 Entropy
26/97
Motivation
|A B| =
(xA xB)2 + (yA yB)2 |IBIA| =
log
1
q(x)
log
1
p(x)
= log
p(x)
q(x)
Average:
xXp(x)log
p(x)
q(x)
Relative entropy (or Kullback leibler distance)
-
8/12/2019 Chapter02 Entropy
27/97
Relative entropy (or Kullback-leibler distance)
Definition: a measure of the information distanceor theinformational divergencebetween two p.m.f., p(x) and q(x):
D(p(x)||q(x)) =xX
p(x)log
p(x)
q(x)
= Ep
log
p(X)
q(X)
When p(x) is the true p.m.f. ofX, this measures theinefficiency of assuming q(x) is the p.m.f. ofX.
It is distance-like in many respects. It is not a true distance, since it
Is not symmetricD(p||q)v.s.D(q||p)
Does not satisfy the triangle inequality
D(p||q) +D(q||r)v.s.D(p||r)
Geometric space
-
8/12/2019 Chapter02 Entropy
28/97
Geometric space
Characterize the position of any point with coordinates
Compute distances using coordinates
Symmetric distance
|P1,P2| = |P2,P1|
Triangle inequality
|P1,P2| + |P2,P3| |P1,P3|
Relative entropy: example
-
8/12/2019 Chapter02 Entropy
29/97
Relative entropy: example
Let x X = {0, 1}, p(0) = 1 r, p(1) =r, q(0) = 1 s, q(1) =s.
D(p(x)||q(x)) =p(0) log
p(0)
q(0)
+p(1) log
p(1)
q(1)
= (1 r)log
1 r
1 s
+rlog
r
s
D(q(x)||p(x)) =q(0) log
q(0)p(0)
+q(1) log
q(1)p(1)
= (1 s)log
1 s
1 r
+slog
s
r
Ifr=s, D(p||q) =D(q||p)
Ifr=s, such as r= 1/2, s= 1/4 D(p||q) = 0.2075 bits, D(q||p) = 0.1887 bits
Thus, in general D(p||q) =D(q||p).
Mutual information
-
8/12/2019 Chapter02 Entropy
30/97
Mutual information
Things are commonly related; two random variables areusually related.
In an information perspective, how to characterize therelationship between r.v.X and r.v.Y?
Observe Xalone, the information ofX is H(X). KnowingY, the information ofX becomes H(X|Y). KnowingY, the information ofX is reduced by
=H(X) H(X|Y). This reduced information is the uncertainty ofX after
knowingY
Mutual information
-
8/12/2019 Chapter02 Entropy
31/97
Mutual information
Mutual informationis the relative entropy between the jointdistribution and the product distribution of two random
variablesX, Y:
I(X; Y) = D[p(x, y)||p(x)p(y)]
=xX
yY
p(x, y)log p(x, y)
p(x)p(y)
= E(X,Y)
log
p(X,Y)
p(X)p(Y)
Measure of the information one random variable (say, X)contains on the other (Y)
Special cases IfX and Yare independent, I(X; Y) = 0. IfY =X, I(X; X) =H(X).
Mutual information
-
8/12/2019 Chapter02 Entropy
32/97
Mutual information
Conditional relative entropy
D(p(y|x)||q(y|x)) =xX
p(x)yY
p(y|x)log
p(y|x)
p(x|y)
Conditional mutual information
I(X; Y|Z) =xX
yY
zZ
p(x, y, z)log
p(x, y|z)
p(x|z)p(y|z)
= Ep(x,y,z)
log p(X,Y|Z)
p(X|Z)p(Y|Z)
Mutual information vs Entropy
-
8/12/2019 Chapter02 Entropy
33/97
py
I(X; Y) =H(X) H(X|Y)
Proof:
I(X; Y) =x
y
p(x, y)log
p(x, y)p(x)p(y)
=x
y
p(x, y)log
p(x|y)
p(x)
= x
y
p(x, y) log[p(x)] +x
y
p(x, y)log[p(x|y)]
= x
p(x)log[p(x)] +x
y
p(x, y)log[p(x|y)]
= H(X) H(X|Y)
Mutual information vs Entropy
-
8/12/2019 Chapter02 Entropy
34/97
py
Expression I(X; Y) =H(X) H(X|Y) =H(Y) H(Y|X) =I(Y; X) I(X; Y) =H(X) +H(Y) H(X,Y) I(X; X) =H(X) I(X; Y|Z) =H(X|Z) H(X|Y,Z)
Venn Diagram
Example #1
-
8/12/2019 Chapter02 Entropy
35/97
p #
Joint p.m.f. is:
YX
1 2 3 4 p(y)
1 1/8 1/16 1/32 1/32 1/4
2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4
4 1/4 0 0 0 1/4
p(x) 1/2 1/4 1/8 1/8
What is H(X), H(Y), H(X|Y), H(Y|X), H(X,Y), I(X; Y)?
Solution of example #1
-
8/12/2019 Chapter02 Entropy
36/97
p #
H(X) = xXp(x)log[p(x)]=H
1
2,1
4,1
8,1
8
=
1
2log(
1
2) +
1
4log(
1
4) +
1
8log(
1
8) +
1
8log(
1
8)
= 1.75 bits
H(Y) = yY
p(y)log[p(y)]
=H1
4 ,
1
4 ,
1
4 ,
1
4
=
1
4log(
1
4) +
1
4log(
1
4) +
1
4log(
1
4) +
1
4log(
1
4)
= 2 bits
Solution of example #1
-
8/12/2019 Chapter02 Entropy
37/97
H(X|Y) =yY
p(y)H(X|Y =y)
=yY
p(y)xX
p(x|y)log
1p(x|y)
=
xX
yY
p(x, y)log[p(x|y)]
= xX
yY
p(x, y)log
p(x, y)p(y)
=
18log
1814
+ 116log1
1614
+ 132log1
3214
+ 132log1
3214
+ 116log1
1614
+ 18log1814
+ 132log1
3214
+ 132log1
3214
+ 116log11614
+ 116log11614
+ 116log11614
+ 116log11614
+ 14log1414
+ 0 log 014
+ 0 log 014
+ 0 log 014
= 1.375 bits
Solution of example #1
-
8/12/2019 Chapter02 Entropy
38/97
H(Y|X) =xX
p(x)H(Y|X=x)
=xX
p(x)yY
p(y|x)log
1p(y|x)
=
xX
yY
p(x, y)log[p(y|x)]
= xX
yY
p(x, y)log
p(x, y)p(x)
=
18log
1812
+ 116log1
1614
+ 132log1
3218
+ 132log1
3218
+ 116log1
1612
+ 18log1814
+ 132log1
3218
+ 132log1
3218
+ 116log11612
+ 116log11614
+ 116log11618
+ 116log11618
+ 14log1412
+ 0 log 014
+ 0 log 018
+ 0 log 018
= 1.625 bits
Solution of example #1
-
8/12/2019 Chapter02 Entropy
39/97
H(X,Y) = xX
yY
p(x, y)log[p(x, y)]
=
18log
18 +
116log
116 +
132log
132 +
132log
132
+ 1
16log 1
16 + 1
8log 1
8 + 1
32log 1
32 + 1
32log 1
32+ 116log
116 +
116log
116 +
116log
116 +
116log
116
+ 14log 14 + 0 log 0 + 0 log 0 + 0 log 0
= 3.375 bits
Note that H(X,Y) =H(X) +H(Y|X) =H(Y) +H(X|Y) byobservation in this example.
Solution of example #1
-
8/12/2019 Chapter02 Entropy
40/97
Method 1:
I(X; Y) =H(X) H(X|Y) = 1.75 1.375 = 0.375 bit
I(X; Y) =H(Y) H(Y|X) = 2 1.625 = 0.375 bit
Method 2:
I(X,Y) =D[p(x, y)||p(x)p(y)]
=xX
yY
p(x, y)log
p(x, y)
p(x)p(y)
=
18log
18
12
14
+ 116log1
1614
14
+ 132log1
3218
14
+ 132log1
3218
14
+ 116log1
1612
14
+ 18log1
814
14
+ 132log1
3218
14
+ 132log1
3218
14
+ 116log1
1612
14
+ 116log1
1614
14
+ 116log1
1618
14
+ 116log1
1618
14
+ 14log14
12
14
+ 0 log 0 + 0 log 0 + 0 log 0
= 0.375 bit
The chain rule: motivation
-
8/12/2019 Chapter02 Entropy
41/97
In calculus, the chain rule is a formula for computing thederivative of the composition of two or more functions.
Let y=f(u) and u=g(x).
[f(g(x)] =f(g(x))g(x)
dy
dx =
dy
du
du
dx
In information theory, the chain rule is a formula for
computing the entropies of the composition of two or morerandom variables.
The chain rule
-
8/12/2019 Chapter02 Entropy
42/97
H(X,Y) =H(X) +H(Y|X) Proof:
H(X, Y) = x
y
p(x, y)log[p(x, y)]
= x
y
p(x, y)log[p(x)p(y|x)]
= x
y
p(x, y) log[p(x)] x
y
p(x, y) log[p(y|x)]
=
xp(x) log[p(x)]
x yp(x, y)log[p(y|x)]
=H(X) +H(Y|X) Corollary
H(X,Y|Z) =H(X|Z) +H(Y|X, Z)
Example #2
-
8/12/2019 Chapter02 Entropy
43/97
Joint p.m.f. is:
YX
1 2 3 4 p(y)
1 1/8 1/16 1/32 1/32 1/4
2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4
4 1/4 0 0 0 1/4
p(x) 1/2 1/4 1/8 1/8
What is H(X),H(Y),H(X|Y),H(Y|X),H(X,Y)?
Compute H(X), H(Y)
-
8/12/2019 Chapter02 Entropy
44/97
H(X) =H(1/2, 1/4, 1/8, 1/8)
= i
p(X =i)log p(X =i)
=
1
2log
1
2+
1
4log
1
4+
1
8log
1
8+
1
8log
1
8
= 1.75bits
H(Y) =H(1/4, 1/4, 1/4, 1/4)
= i
p(Y =i)log p(Y =i)
=
1
4log
1
4+
1
4log
1
4+
1
4log
1
4+
1
4log
1
4
= 2bits
Compute H(X|Y)
-
8/12/2019 Chapter02 Entropy
45/97
H(X|Y) = j
p(Y =j)H(X|Y =j)
= j
p(Y =j)i
p(X =i|Y =j)log[p(X =i|Y =j)]
= ij
p(X =i,Y =j)log p(X =i,Y =j)p(Y =j)
=
18log
1814
+ 116log
11614
+ 116log
11614
+ 14log
1414
+ 116log
11614
+ 18log
1814
+ 116log
11614
+ 0 log
014
+ 1
32
log 13214 + 1
32
log 13214 + 1
16
log 11614 + 0 log 01
4
+ 132log
13214
+ 132log
13214
+ 116log
11614
+ 0 log
014
= 1.375bits
Compute H(Y|X)
-
8/12/2019 Chapter02 Entropy
46/97
H(Y|X) = i
p(X =i)H(Y|X=i)
= i
p(X =i)j
p(Y =j|X =i)log[p(Y =j|X =i)]
= ij
p(X =i,Y =j)log p(X =i,Y =j)p(X =i)
=
18log
1812
+ 116log
11612
+ 116log
11612
+ 14log
1412
+ 116log
11614
+ 18log
1814
+ 116log
11614
+ 0 log
014
+ 132log 1321
8 + 132log 1321
8 + 116log 1161
8 + 0 log 01
8
+ 132log
13218
+ 132log
13218
+ 116log
11618
+ 0 log
018
= 1.625bits
Compute H(X,Y)
-
8/12/2019 Chapter02 Entropy
47/97
H(X) =H(1/2, 1/4, 1/8, 1/8) = 1.75 bits
H(Y) =H(1/4, 1/4, 1/4, 1/4) = 2 bitsH(X|Y) =
iPr(Y =i)H(X|Y =i) = 1.375 bits
H(Y|X) =iPr(X =i)H(Y|X =i) = 1.625 bits
H(X,Y) =H(X) +H(Y|X) = 1.75 + 1.375 = 3.375 bits (chain
rule)H(X) H(X|Y) = 1.75 1.375 =0.375bitsH(Y) H(Y|X) = 2 1.625 =0.375bits
I(X; Y) =H(X) H(X|Y)
I(X; Y) =H(Y) H(Y|X)
Chain rules
-
8/12/2019 Chapter02 Entropy
48/97
Chain rules can be derived by repeated applications oftwo-variable expansion rules
H(X,Y) =H(X) +H(Y|X)
Entropy
H(X1,X2, . . . ,Xn) =ni=1H(Xi|Xi1,Xi2, . . . ,X1)
Mutual information
I(X1,X2, . . . ,Xn; Y) =
ni=1I(Xi; Y|Xi1,Xi2, . . . ,X1)
Relative entropy
D(p(x, y)||q(x, y)) =D(p(x)||q(x)) +D(p(y|x)||q(y|x))
Chain rule examples
-
8/12/2019 Chapter02 Entropy
49/97
H(X1,X2,X3) =3i=1
H(Xi|Xi1,Xi2, . . . ,X1)
=H(X1) +H(X2|X1) +H(X3|X2,X1)
I(X1,X2,X3; Y) =3
i=1I(Xi; Y|Xi1,Xi2, . . . ,X1)
=I(X1; Y) +I(X2; Y|X1) +I(X3; Y|X2,X1)
Conditional entropies in communication systems
-
8/12/2019 Chapter02 Entropy
50/97
System model Source sends r.v.X, destination receives r.v.Y. Realization ofX (orY) is xi (oryi).
Information transferred from the source to the destination?
Options: H(X), H(Y), H(X,Y), H(X|Y), H(Y|X), I(X; Y)
I(X; Y) = xX
yY
p(x, y)log p(x, y)p(x)p(y)
= xX
yY
p(x)p(y|x)log p(y|x)p(y)
I(X; Y): a function of the input p(x) and the channel
characteristics p(y|x). Channel capacity: C= max
p(x){I(X,Y)}
H(X|Y) in communication systems
-
8/12/2019 Chapter02 Entropy
51/97
Ideally, H(X) should be transmitted from the source to the
destination.
H(X) =H(X|Y) +I(X; Y)
I(X; Y) =H(X) H(X|Y)
At Destination, after Y is received, there still exists averageuncertainty about source Xdue to the transmission distortion in thechannel.
H(X|Y): loss entropy
H(Y|X) in communication systems
-
8/12/2019 Chapter02 Entropy
52/97
Ideally, if there is no noise in the channel, there should existdeterministicrelationship between the sender and the receiver.
H(Y) =I(Y; X) +H(Y|X)
I(Y; X) =H(Y) H(Y|X)
At Source, after X is sent, there still exists averageuncertainty about destination Ydue to the channel noise.
H(Y|X): noise entropy
Mutual information of realization: atdestination
-
8/12/2019 Chapter02 Entropy
53/97
Priori probability p(xi): uncertainty on xi withoutreceiving yj
Posteriori probability p(xi|yj): uncertainty on xi withreceiving yj
I(xi; yj): amount of uncertainty reduction by receiving yj
I(xi; yj) =I(xi) I(xi|yj)
= log
1p(xi)
log
1
p(xi|yj)
= log
p(xi|yj)
p(xi)
Mutual information of realization: atsource
-
8/12/2019 Chapter02 Entropy
54/97
Priori probability p(yj): uncertainty on yjwithout sending xi
Posteriori probability p(yj|xi): uncertainty on yjwith sending xi I(yj; xi): amount of uncertainty reduction by sending xi
I(yj; xi) =I(yj) I(yj|xi)
= log
1
p(yj)
log
1
p(yj|xi)
= log
p(yj|xi)
p(yj)
I(xi;yj) vs I(yj; xi)
-
8/12/2019 Chapter02 Entropy
55/97
I(xi; yj) = I(yj; xi)
Proof:
I(xi; yj) = log
p(xi|yj)
p(xi)
=log
p(xi|yj)p(yj)p(xi)p(yj)
=log
p(xi, yj)
p(xi)p(yj)
=log
p(yj|xi)p(yj)
=I(yj; xi)
Mutual information of realization: system
-
8/12/2019 Chapter02 Entropy
56/97
Before communication, X and Yare considered to be statisticallyindependent.
p(xi, yj) =p(xi)p(yj)
Ibefore(xi, yj) = log
1
p(xi, yj)
= log
1
p(xi)p(yj)
= log
1
p(xi)
+ log
1
p(yj)
= I(xi) +I(yj)
After communication, X and Yare related due to channel characteristics.
p(xi, yj) =p(xi)p(yj|xi) =p(yj)p(xi|yj)
Iafter(xi, yj) = log
1p(xi, yj)
= I(xi, yj)
I(xi; yj) is the reduction of uncertainty before and after communication.
I(xi; yj)= Ibefore(xi, yj) Iafter(xi, yj) =I(xi) +I(yj) I(xi, yj)
Mutual information of realization: equivalency
-
8/12/2019 Chapter02 Entropy
57/97
At destination, I(xi; yj) =I(xi) I(xi|yj).
At source, I(yj; xi) =I(yj) I(yj|xi).
From system, I(xi; yj) =I(xi) +I(yj) I(xi, yj)
I(xi, yj) = log
1
p(xi, yj)
= log
1
p(xi)p(yj|xi)
= log
1
p(xi)
+ log
1
p(yj|xi)
=I(xi) +I(yj|xi)
I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(xi)+I(yj|xi)] =I(yj)I(yj|xi)
I(yj, xi) = log 1p(yj, xi)
= log 1p(yj)p(xi|yj)
= log 1p(yj)
+ log 1p(xi|yj)
=I(yj) +I(xi|yj)
I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(yj)+I(xi|yj)] =I(xi)I(xi|yj)
Mutual information in communication systems
-
8/12/2019 Chapter02 Entropy
58/97
Mutual information of realization at the micro-level I(xi; yj) = log p(xi|yj)
p(xi)= log 1
p(xi) log 1
p(xi|yj)
At destination: I(xi; yj) =I(xi) I(xi|yj) At source: I(yj; xi) =I(yj) I(yj|xi) From system: I(xi; yj) =I(xi) +I(yj) I(xi, yj)
Mutual information at the macro-level
I(X; Y) =D[p(x, y)||p(x)p(y)] =xX
yY
p(x, y)log
p(x, y)
p(x)p(y)
I(X; Y) =I(Y; X)
I(xi;yj) vs I(X; Y)
-
8/12/2019 Chapter02 Entropy
59/97
I(xi; yj) = log
p(xi, yj)
p(xi)p(yj)
I(X; Y) =D[p(x, y)||p(x)p(y)]
= xiX
yjY
p(xi, yj) log p(xi, yj)
p(xi)p(yj)=xiX
yjY
p(xi, yj)I(xi; yj)
=EX,Y [I(x; y)]
An example communication systemGiven a discrete source of
X
p(X )
=
x10 2
x20 8
, the output messages
-
8/12/2019 Chapter02 Entropy
60/97
p(X)
0.2 0.8
pass through a noise channel; then, the received messages are modeled using
Y = [y1, y2].
self-information in event x1: I(x1) = log 1p(x1)
= 2.322 bits
p(y1) =xi
p(xi)p(y1|xi) = 0.335
I(x1; y1) = log2
p(y1|x1)
p(y1)
= log2
7/80.335
= 1.39 bits
I(x1; y2) = log2
p(y2|x1)
p(y2)
= 2.42 bits
Motivation
-
8/12/2019 Chapter02 Entropy
61/97
Jensens Inequality has theconvexity.
Jensens inequality preview
-
8/12/2019 Chapter02 Entropy
62/97
It is used very widely in information theory.
Most basic theorems are proved based on Jensens inequality.
Preview:Iff is a convex function, then E[f(X)] f(E[X]).
What is convexity?Convex functions lie below any chord.
-
8/12/2019 Chapter02 Entropy
63/97
ConvexConcave upwardsConcave upConvex cup
Function f(x) is convex over (a, b) if
x1, x2 (a, b), 0 1f( x1+ (1 ) x2) f(x1) + (1 ) f(x2)
Function f(x) is strictly convex over (a, b) if it is convex and
x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or= 1
What is concavity?
-
8/12/2019 Chapter02 Entropy
64/97
Concave functions lie above any chord
Function f(x) is concave over (a, b) if
x1, x2 (a, b), 0 1f( x1+ (1 ) x2)) f(x1) + (1 ) f(x2)
Functionf(x) is strictly concave over (a, b) if it is concave and
x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or
= 1
Examples
-
8/12/2019 Chapter02 Entropy
65/97
Test of convexity and concavityIf function f(x) has a secondderivative f(x), which is non-negative (positive) everywhere,then f(x) is convex (strictly convex).
Examples of convex and concave functions
Jensens inequality: proof
-
8/12/2019 Chapter02 Entropy
66/97
Iff is convex, then for r.v.X, E[f(X)] f(E[X]).
Iff is strictly convex, X =E[X] with probability 1.
Sketch of the proof: We prove this for discrete distributions by themathematical induction2 on the number of the mass points.
n= 2, the inequality becomesp1f(x1) +p2f(x2) f(p1x1+p2x2). It holds by convexity.
Suppose the theorem is true for distributions with n 1masspoints.
n1i=1
qif(xi) fn1i=1
qixi
Then, prove the inequality holds for n.
2http://en.wikipedia.org/wiki/Mathematical induction
Jensens inequality: proof
If f is convex then for r v X E [f (X )] f (E [X ])
-
8/12/2019 Chapter02 Entropy
67/97
Iff is convex, then for r.v.X, E[f(X)] f(E[X]).
Iff is strictly convex, X =E[X] with probability 1.
E[f(X)] =ni=1
pif(xi) =pnf(xn) +n1i=1
pif(xi)
= pnf(xn) + (1 pn)
n1i=1
pi1 pn f(x
i)
pnf(xn) + (1 pn)f
n1i=1
pi1 pn
xi
f
pnxn+ (1 pn)n1i=1
pi1 pn
xi
= f
ni=1
pixi
= f(E[X])
Relative-entropy properties
-
8/12/2019 Chapter02 Entropy
68/97
We can use Jensens Inequality to prove some of the properties ofrelative entropy.
Theorem: Information inequalityLet p(x), q(x), x X, be two p.m.f.s. Then,
D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).
Corollary: Non-negativity of mutual information
I(X; Y) 0.
I(X; Y) = 0 X and Yare independent.
Entropy properties proved by Jensens inequality
Theorem: Uniform PMF maximizes the entropy
-
8/12/2019 Chapter02 Entropy
69/97
Theorem: Uniform PMF maximizes the entropy
H(X) log(|X |)
H(X) = log(|X |) p(x) =q(x) = 1/|X |
Theorem: Conditioning reduces entropy
H(X|Y) H(X)
Theorem: Independence bound on entropy
H(X1,X2, . . . ,Xn) i
H(Xi)
H(X1,X2, . . . ,Xn) =i
H(Xi) Xiare independent with each other.
Information inequality
[Theorem:] Let p(x) q(x) x X be two p m f s then
-
8/12/2019 Chapter02 Entropy
70/97
[Theorem:] Let p(x), q(x), x X, be two p.m.f. s, then
D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).
Proof:Let A = {x :p(x)> 0} be the support set ofp(x), then
D(p(x)||q(x)) = xA
p(x)log
p(x)
q(x)
= xA
p(x)log q(x)
p(x) log xA
p(x)q(x)
p(x) (by Jensens inequality= log
xA
q(x)
log
xX
q(x)
= log 1 = 0
Note on information inequality
( ) q(x)
( )q(x)
-
8/12/2019 Chapter02 Entropy
71/97
xAp(x)log
q(x)p(x)
log
xA
p(x)q(x)p(x)
Notey= log(t): a strictly concave function oft.
Consider a simple case n= 2. Let 1 =p(x1), 2 =p(x2), t1 = q(x1)p(x1)
,
t2 = q(x2)p(x2)
. Then,
1log t1+ 2log t2 log(1t1+ 2t2) .
p(x1)log
q(x1)
p(x1)
+p(x2)log
q(x2)
p(x2)
log
p(x1)
q(x1)
p(x1)+p(x2)
q(x2)
p(x2)
.
Corollary: non-negativity of mutual information
-
8/12/2019 Chapter02 Entropy
72/97
I(X; Y) 0;
I(X; Y) = 0 X and Y are independent.
Proof: I(X; Y) =D(p(x, y)||p(x)p(y)) 0
With equality if and only ifp(x, y) =p(x)p(y), i.e., X and Y areindependent.
A Binary Source: Entropy
-
8/12/2019 Chapter02 Entropy
73/97
Consider a binary source of X
p(X)
= x1,
p,
x2
1 p
.
H(X) = xXp(x)log[p(x)]
= plog2p (1 p)log2(1 p)
Whenp= 0.5,H(X) = 1 bit.H(X) log2 |X |.
Theorem: uniform PMF maximizes the entropy
H(X) log |X |;
-
8/12/2019 Chapter02 Entropy
74/97
H(X) = log |X | p(x) =q(x) = 1|X | .
Proof: Let u(x) = 1|X| be the uniform p.m.f . over X. Let p(x) be the p.m.f . forr.v.X. Then,
D(p(x)||u(x)) =x
X
p(x)logp(x)
u(x)
=xXp(x)log
1
u(x)
xXp(x)log p(x)
=xXp(x)log |X |H(X)
= log |X |H(X).
Hence, by the non-negativity of relative entropy,
0 D(p(x)||u(x)) = log |X |H(X).
Theorem: conditioning reduces entropy
-
8/12/2019 Chapter02 Entropy
75/97
H(X|Y) H(X)
Proof:
0 I(X; Y) =H(X)H(X|Y).Comments:
Knowing another r.v.Ycan only reduce the uncertainty in X.This is true only on theaverage.
Theorem: independence bound on entropy
-
8/12/2019 Chapter02 Entropy
76/97
H(X1,X2, . . .Hn) i
H(Xi).
H(X1,X2, . . .Hn) i
H(Xi) Xiare independent with each other.
Proof:By the chain rule for entropy, we apply the theorem of conditioningreduces entropy.
H(X1,X2, ...,Xn) =n
i=1
H(Xi|Xi1,Xi2, . . . ,X1)
ni=n
H(Xi)
Log sum inequality: theorem
For non-negative numbers, ai and bi, (i= 1, 2, . . . , n),
-
8/12/2019 Chapter02 Entropy
77/97
ni=1
ailogaibi
ni=1
ai
logni=1ai
ni=1
bi
.
With equality, if and only if aibi
= constant.
Log sum inequality: proof
-
8/12/2019 Chapter02 Entropy
78/97
Proof: (a brief sketch)
Assume ai and biare positive.
Construct f(t) =tlog t.
The function f(t) =tlog tis strictly convex for all positive t.
Construct i = bijbj
and ti= aibi
.
By Jensens inequality,
if(ti) f(
iti).
Then we obtain the log sum inequality
Log sum inequality: elaboration ai 0 and bi 0. 0 log 0 = 0, a log
a0
= ifa>0 and 0 log 0 = 00
.
Construct f (t) = t log t The function f (t) = t log t is strictly convex
-
8/12/2019 Chapter02 Entropy
79/97
Construct f(t) =tlog t. The function f(t) =tlog t is strictly convex,since f(t) = 1
t>0 for all positive t.
By Jensens inequality,(if(ti)) f
iti
.
For i 0, i= 1. Set i =
binj=1
bj
and ti = aibi
.
f(ti) = ai
bilog
ai
bi
ni=1
[if(ti)] =ni=1
binj=1
bj
ai
bilog
ai
bi
=
ni=1
ainj=1
bj
log
ai
bi
Log sum inequality: elaboration
n
[ f ( )]n
ail
ai
-
8/12/2019 Chapter02 Entropy
80/97
i=1[if(ti)] =
i=1 i
n
j=1
bj
log
i
bi
f
n
i=1(iti)
=n
i=1(iti) log
n
i=1(iti)
=ni=1
binj=1
bj
ai
bi
log
ni=1
binj=1
bj
ai
bi
=
ni=1
ainj=1
bj
logni=1
ainj=1
bj
Log sum inequality: elaboration
B J i lit
-
8/12/2019 Chapter02 Entropy
81/97
By Jensens inequality,
(if(ti)) f
iti
.
n
i=1
ainj=1
bj
log ai
bi
n
i=1
ainj=1
bj
log
n
i=1
ainj=1
bj
ni=1
ailog
aibi
ni=1 ai
log
n
i=1ai
nj=1
bj
Log sum inequality: applicationsTheorem: convexity of relative entropy
-
8/12/2019 Chapter02 Entropy
82/97
D(p||q) is convex in the pair (p, q);
D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2).Corollary: convexity of mutual information
Theorem: concavity of entropy
H(p) is a concave function ofp.
Data processing inequality
-
8/12/2019 Chapter02 Entropy
83/97
Markov Chain: Random variables X, Y, Z form a Markovchain (X Y Z), if
p(x, y, z) =p(x)p(y|x)p(z|y).
Note that by the chain rule,
p(x, y, z) =p(x)p(y, z|x) =p(x)p(y|x)p(z|y, x).
Consequence: Markovity implies conditional independencebecause
p(x, z|y) =p(x, y, z)
p(y) =
p(x, y)p(z|y)
p(y) =p(x|y)p(z|y).
Data processing inequality: theorem
-
8/12/2019 Chapter02 Entropy
84/97
IfX Y Z, then
I(X; Y) I(X; Z).
Proof:
I(X; Y,Z) =I(X; Z) +I(X; Y|Z) =I(X; Y) +I(X; Z|Y),
I(X; Z|Y) = 0 and I(X; Y|Z) 0.
Thus, we have I(X; Y) I(X; Z).
Comments: Manipulation of data cannot increase its information.
Summary: model
S
-
8/12/2019 Chapter02 Entropy
85/97
Single outcome or outcome sequence
Continuous or Discrete
Summary: basic system of the simplest discrete source
-
8/12/2019 Chapter02 Entropy
86/97
Notations
Sample space: X Random variable (r.v.): X Outcome ofXor realization ofX : x Cardinality of set X (the number of elements): |X |
Probability mass function (p.m.f.)
P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y
Summary: concept
-
8/12/2019 Chapter02 Entropy
87/97
Self-InformationI(x)
I(x) = log[p(x)] = log
1p(x)
Measure of uncertainty of single outcome
Non-negative
Summary: concept
-
8/12/2019 Chapter02 Entropy
88/97
Entropy H(X)
H(X) = E[log(p(x))] =E
log
1p(x)
H(X) =EX[I(x)]
Measure of uncertainty of information source
Non-negative
Summary: concept
-
8/12/2019 Chapter02 Entropy
89/97
Joint entropy H(X,Y)
Conditional entropy H(X|Y) orH(Y|X)
Chain ruleH(X,Y) =H(X) +H(Y|X)
Summary: concept
Self-Information I(x)
-
8/12/2019 Chapter02 Entropy
90/97
Measure of uncertainty of single outcome
Non-negative Entropy H(X)
H(X) =EX[I(x)] Measure of uncertainty of information source
Non-negative Relative entropy D(p(x)||q(x))
Measure of similarity of distributions Non-negative
Mutual information I(X; Y) I(X; Y) =D[p(x, y)||p(x)p(y)] =EX,Y[I(x; y)] Measure of similarity between joint and product p.m.f.s Special case of relative entropy (Non-negative)
Summary: entropy properties
-
8/12/2019 Chapter02 Entropy
91/97
Non-negativity H(X) 0
Chain Rule H(X,Y) =H(X) +H(Y|X)Uniform p.m.f. H(X) log(|X|)maximization
Conditional H(X|Y) H(X)
reductionIndependence H(X1,X2, . . . ,Xn) iH(Xi)
bound
Concavity H(p(x) + (1 )p(x)) H(p(x)) + (1 )H(p(x))
Having maximum in a given range
Summary: mutual information properties
-
8/12/2019 Chapter02 Entropy
92/97
Name Expression
Non-negativity I(X; Y) 0
Maximum I(X; Y) =H(X) +H(Y)H(X,Y) H(X)
Symmetry I(X; Y) =I(Y; X)
Convexity D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2)Having minimum in a given range
Summary: entropy
-
8/12/2019 Chapter02 Entropy
93/97
Concept Relationship Venn Diagram
H(X)H(X) H(X|Y)H(X) =H(X|Y) +I(X; Y)H(X) =H(X,Y) H(Y|X)
H(Y)H(Y) H(Y|X)H(Y) =H(Y|X) +I(X; Y)H(Y) =H(X,Y) H(X|Y)
Summary: conditional entropy
-
8/12/2019 Chapter02 Entropy
94/97
Concept Relationship Venn Diagram
H(X|Y) H(X|Y) =H(X,Y)H(Y)
H(X|Y) =H(X)I(X; Y)
H(Y|X) H(Y|X) =H(X,Y)H(X)
H(Y|X) =H(Y)I(X; Y)
Summary: joint entropy and mutual information
-
8/12/2019 Chapter02 Entropy
95/97
Concept Relationship Venn Diagram
H(X,Y)
H(X,Y) =H(X) +H(Y|X)H(X,Y) =H(Y) +H(X|Y)H(X,Y) =H(X) +H(Y)I(X; Y)H(X,Y) =H(X|Y) +H(Y|X) +I(X; Y)
I(X; Y)
I(X; Y) =H(X)H(X|Y)I(X; Y) =H(Y)H(Y|X)I(X; Y) =H(X,Y)H(Y|X)(X|Y)I(X; Y) =H(X) +H(Y)H(X,Y)
Summary
Entropy
-
8/12/2019 Chapter02 Entropy
96/97
Joint and conditional entropy
Relative entropy and mutual information
Chain rules
Jensens inequality
Log sum inequality
Data processing inequality
Reference
-
8/12/2019 Chapter02 Entropy
97/97
T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed.Hoboken, N.J. : J. Wiley, 2006.
top related