ee515a information theory ii spring 2012 -...
TRANSCRIPT
Logistics Logistics IT-II Differential Entropy Scratch
EE515A – Information Theory IISpring 2012
Prof. Jeff Bilmes
University of Washington, SeattleDepartment of Electrical Engineering
Spring Quarter, 2012http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/
Lecture 19 - March 27th, 2012
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 1
Logistics Logistics IT-II Differential Entropy Scratch
Outstanding Reading
Read all chapters assigned from IT-I (EE514, Winter 2012).
Read chapter 8 in the book (but what book you might ask? You’llsoon see if you don’t know).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 2
Logistics Logistics IT-II Differential Entropy Scratch
Information Theory I and IIThis two-quarter course will be an thorough introduction toinformation theory.Information Theory I (EE514, Winter 2012): entropy, mutualinformation, asymptotic equipartition properties, data compressionto the entropy limit (source coding theorem), Huffman,communication at the channel capacity limit (channel codingtheorem), method of types, arithmetic coding, Fano codesInformation Theory II (EE515, Spring 2012) : Lempel-Ziv,convolutional codes, differential entropy, maximum entropy, ECC,turbo, LDPC and other codes, Kolmogorov complexity, spectralestimation, rate-distortion theory, alternating minimization forcomputation of RD curve and channel capacity, more on theGaussian channel, network information theory, information geometry,and some recent results on use of polymatroids in information theory.Additional topics throughout will include information theory as it isapplicable to pattern recognition, natural language processing,computer science and complexity, biological science, andcommunications.Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 3
Logistics Logistics IT-II Differential Entropy Scratch
Course Web Pages
our web page (http://j.ee.washington.edu/~bilmes/classes/ee515a_spring_2012/)
our dropbox (https://catalyst.uw.edu/collectit/dropbox/bilmes/21171) whichis where all homework will be due, electronically, in PDF format. Nopaper homework accepted.
our discussion board(https://catalyst.uw.edu/gopost/board/bilmes/27386/) iswhere you can ask questions. Please use this rather than email sothat all can benefit from answers to your questions.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 4
Logistics Logistics IT-II Differential Entropy Scratch
Prerequisites
basic probability and statistics
random processes (e.g., EE505 or a Stat 5xx class).
Knowledge of matlab.
EE514A - Information Theory I (or equivalent).
The course is open to students in all UW departments.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 5
Logistics Logistics IT-II Differential Entropy Scratch
Homework
There will be about 4-5 homeworks this quarter.
You will have approximately 1 to 1.5 weeks to solve the problem set.
Problem sets might also include matlab exercises, so you will need tohave access to matlab, please let me know if you currently do not.
The problem sets that are longer will take longer to do, so please donot wait until the night before they are due to start them.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 6
Logistics Logistics IT-II Differential Entropy Scratch
Homework Grading Strategy
We will do swap grading. I.e. you will turn in your homeworks, andI’ll send them out randomly for others to grade.
You will receive a grade on your homework from the person who isgrading it.
You will also assign a grade to your grader, for the quality of thegrading (but not the grade) given to you.
This means, for each HW you will get two grades: one for your HW,and one for the quality of grading you do.
No collusion! ,
No grading your own HW. ,
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 7
Logistics Logistics IT-II Differential Entropy Scratch
Exams
No exams this quarter.
Final presentations: Monday, June 4th in the afternoon late/evening(currently scheduled for 8:30am but that is too early).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 8
Logistics Logistics IT-II Differential Entropy Scratch
Grading
Final grades will be based on a combination of the homeworks(60%) and the final presentations (40%).
If you are active in the class (via attendance, asking good questions,etc.), this may boost your final grade.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 9
Logistics Logistics IT-II Differential Entropy Scratch
On Final Presentations
Your task is to give a 15-20 minute presentation that summarizes2-3 related and significant papers that come from IEEE Transactionson Information Theory.
The papers must not be ones that we covered in class, althoughthey can be related.
You need to do the research to find the papers yourself (i.e., that ispart of the assignment).
The papers must have been published in the last 10 years (so no oldor classic papers).
Your grade will be based on how clear, understandable, and accurateyour presentation is.
This is a real challenge and will require significant work! Many ofthe papers are complex. To get a good grade, you will need to workto present complex ideas in a simple way.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 10
Logistics Logistics IT-II Differential Entropy Scratch
Our Main Text
“Elements of Information Theory” by Thomas Cover and JoyThomas, 2nd edition (2006).
It should be available at the UW bookstore, or you can get it, say,via amazon.com.
Reading assignment: Read Chapter 8.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 11
Logistics Logistics IT-II Differential Entropy Scratch
Other Relevant Texts
“A First Course in Information Theory”, by Raymond W. Yeung,2002. Very good chapters on information measures and networkinformation theory.
“Elements of information theory”, 1991, Thomas M. Cover, Joy A.Thomas, Q360.C68 (First edition of our book).
“Information theory and reliable communication”, 1968, Robert G.Gallager. Q360.G3 (classic text by foundational researcher).
“Information theory and statistics”, 1968, Solomon Kullback, MathQA276.K8
“Information theory : coding theorems for discrete memorylesssystems”, 1981, Imre Csiszar and Janos Korner. Q360.C75 (anotherkey book, but a little harder to read).
“Information Theory, Inference, and Learning Algorithms”, DavidJ.C. MacKay 2003
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 12
Logistics Logistics IT-II Differential Entropy Scratch
Other Relevant Texts
“The Theory of Information and Coding: 2nd Edition” RobertMcEliece, Cambridge, April 2002
“An Introduction to Information Theory: Symbols, Signals, andNoise”, John R. Pierce, Dover 1980
“Information Theory”, Robert Ash, Dover 1965
“An Introduction to Information Theory”, Fazlollah M. Reza, Dover1991
“Mathematical Foundations of Information Theory”, A. Khonchin,Dover 1957. (best brief summary of key results).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 13
Logistics Logistics IT-II Differential Entropy Scratch
Other Relevant Texts
“Convex Optimization”, Boyd and Vandenberghe
“Probability & Measure” Billingsley,
“Probability with Martingales”, Williams
“Probability, Theory & Examples”, Durrett
“Probability and Random Processes”, Grimmett and Stirzaker (greatbook on stochastic processes).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 14
Logistics Logistics IT-II Differential Entropy Scratch
On Our Lecture Slides
First time class is using these lecture slides.
Slides will (hopefully) be available by the early morning beforelecture.
Slides available on the web page, as well as a printable version.
Updated slides with typo and bug fixes will be posted as well as(buggy) ones with hand annotations/corrections/fill-ins.
Annotations are (currently) being made with Adobe Acrobat but theiphone/ipad pdf previewer has trouble viewing them.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 15
Logistics Logistics IT-II Differential Entropy Scratch
Class Road Map - IT-II
L1 (3/28): Overview, Diff Entropy
L2 (3/30):
L3 (4/4):
L4 (4/6): No class.
L5 (4/11):
L6 (4/13):
L6 (4/18):
L7 (4/20): No class.
L8 (4/25):
L9 (4/27):
L10 (5/2):
L11 (5/4):
L12 (5/9):
L13 (5/11):
L14 (5/16):
L15 (5/18):
L16 (5/23):
L17 (5/25):
L18 (5/30):
L19 (6/1):
Finals Presentations: June 4th, 2012.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 16
Logistics Logistics IT-II Differential Entropy Scratch
Shannon’s model of communication
Shannon’s general model of communicationencoder decoder
noise
source channelencoder
channel sourcedecoder
receiversourcecoder
channeldecoder
This has been our guiding principle so far, and will continue to guideus for a while (key exception is network information theory that wewill cover, where the above picture may become an arbitrarydirected graph with multiple sources/receivers).
So far, channel has been discrete, but real-world channels are notdiscrete, we thus need differential entropy.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 17
Logistics Logistics IT-II Differential Entropy Scratch
Sources and Channels
Source Channel
voice telephone line
text words disk
pictures disk
pictures Ethernet or internet
music mp3 file
human cells mitosis
population of living organisms life
So channel can be over a typical communication link(communication over space).
Channel can also be storage (communication over time), orgenerations (biology).
Many applications of IT, not just communication (e.g., Kolmogorovcomplexity).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 18
Logistics Logistics IT-II Differential Entropy Scratch
This is IT-II
This is lecture 19.You know that the source compresses down to the entropy H, butno further.
E(R)
R→H
ErrorExponent
R→H
logPe
0
You also know that the signal may be sent through the channel at arate no more than C.
R→C0
E(R)
ErrorExponent
R→
logPe
C
0
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 19
Logistics Logistics IT-II Differential Entropy Scratch
This is IT-II
What if we want to compress R < H or transmit R > C? ⇒ Error.
But are all errors created equality? Are all errors as bad as others?
We can measure errors with a distortion function, and we havegeneralization of the previously stated results.
Rate-distortion curves with achievable region
Distortion →0
H
Rate R
achievableregion
unachievableregion
No equal rights for equal errors.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 20
Logistics Logistics IT-II Differential Entropy Scratch
This is IT-II
Continuous Entropy
Differential Entropy
Gaussian Channel
Shannon Capacity for the Gaussian Channel
Information Measures
Generalized inequalities for entropy functions
Negative discrete information measures
The class of entropy functions
More on Compression
Lempel-Ziv compression
Kolmogorov complexity (algorithmic compression)
Philosophy
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 21
Logistics Logistics IT-II Differential Entropy Scratch
This is IT-II
Coding with errors
Rate distortion theory
Alternating minimization, Blahut-Arimoto algorithm, EM algorithm,variational EM, information geometry.
Network information theory
multi-port communication
Slepian-wolf,
Multiple sensors/receivers, and models for this.
Polymatroidal theory and network information theory
Codes on Graphs
Convolutional coding
Turbo codes
LDPC codes
inference and LBPProf. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 22
Logistics Logistics IT-II Differential Entropy Scratch
This is IT-II
And more applications in
Communications
Pattern recognition and machine learning
Compute Science
Biology
Philosophy (and poetry).
Social science
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 23
Logistics Logistics IT-II Differential Entropy Scratch
Entropy
H(X) = −∑x p(x) log p(x)
All entropic quantities we’ve encountered in IT-I have been discrete.
The world is continuous, channels are continuous, noise iscontinuous,
We need a theory of compression, entropy, and channel capacitythat applies to such continuous domains.
We explore this next.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 24
Logistics Logistics IT-II Differential Entropy Scratch
Continuous/Differential Entropy
Let X now be a continuous r.v. with cumulative distribution
F (x) = Pr(X ≤ x) (1)
and f(x) = ddxF (x) is the density function.
Let S = {x : f(x) > 0} be the support set. Then
Definition 4.1 (differential entropy h(X))
h(X) = −∫Sf(x) log f(x)dx (2)
Since we integrate over only the support set, no worries about log 0.
Perhaps it is best to do some examples.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 25
Logistics Logistics IT-II Differential Entropy Scratch
Continuous Entropy Of Uniform Distribution
Here, X ∼ U [0, a] with a ∈ R++.
Then
h(X) = −∫ a
0
1
alog
1
adx = − log
1
a(3)
Note: continuous entropy can be both positive or negative.
How can entropy (which we know to mean “uncertainty”, or“information”) be negative?
In fact, entropy (as we’ve seen perhaps once or twice) can beinterpreted as the exponent of the “volume” of a typical set.
Example: 2H(X) is the number of things that happen, on average,and can have 2H(X) � |X |.Consider a uniform r.v. Y such that 2H(X) = |Y|.Thus having a negative exponent just means the volume is small.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 26
Logistics Logistics IT-II Differential Entropy Scratch
Continuous Entropy Of Normal Distribution
Normal (Gaussian) distributions are very important.We have:
X ∼ N(0, σ2) ⇔ f(x) =1
(2πσ2)1/2e−
12x2/σ2
(4)
Lets compute the entropy of f in nats.
h(X) = −∫f ln f = −
∫f(x)
[− x2
2σ2− ln√
2πσ2
]dx (5)
EX2
2σ2+
1
2ln(2πσ2) =
1
2+
1
2ln(2πσ2) (6)
=1
2ln e+
1
2ln(2πσ2) =
1
2ln(2πeσ2)nats×
(1
ln 2bits/nats
)=
1
2log(2πeσ2)bits (7)
Note: only a function of the variance σ2, not the mean. Why?So entropy of a Gaussian is monotonically related to the variance.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 27
Logistics Logistics IT-II Differential Entropy Scratch
AEP lives
We even have our own AEP in the continuous case, but before thata bit more intuition.
In the discrete case, we have Pr(x1, x2, . . . , xn) ≈ 2−nH(X) for big n
and |A(n)ε | = 2nH = (2H)n.
Thus, 2H can be seen like a “side length” of an n-dimensionalhypercube, and 2nH is like the volume of this hypercube (or volumeof the typical set).
So H being negative could mean small side length (small 2H butstill positive).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 28
Logistics Logistics IT-II Differential Entropy Scratch
AEP
Things are similar for the continuous case. Indeed
Theorem 4.2
Let X1, X2, . . . , Xn be a sequence of r.v.’s, i.i.d. ∼ f(x). Then
− 1
nlog f(X1, X2, . . . , Xn)→ E[− log f(X)] = h(X) (8)
this follows via the weak law of large numbers (WLLN) just like inthe discrete case.
Definition 4.3
A(n)ε =
{x1:n ∈ Sn : | − 1
n log f(x1, . . . , xn)− h(X)| ≤ ε}
Note: f(x1, . . . , xn) =∏ni=1 f(xi).
Thus, we have upper/lower bounds on the probability
2−n(h+ε) ≤ f(x1:n) ≤ 2−n(h−ε) (9)
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 29
Logistics Logistics IT-II Differential Entropy Scratch
AEP
The volume of A ⊆ Rn is well defined as:
Vol(A) =
∫Adx1dx2 . . . dxn (10)
then we have
Theorem 4.4
1 Pr(A(n)ε ) > 1− ε
2 (1− ε)2n(h(X)−ε) ≤ Vol(A(n)ε ) ≤ 2n(h(X)+ε)
Note this is a bound on volume Vol(A(n)ε ) of typical set.
In discrete AEP, we bound cardinality of typical set |A(n)ε |, and never
need entropy to be negative — H(X) ≥ 0 suffices to limit |A(n)ε |
down to its lowest sensible value, namely 1.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 30
Logistics Logistics IT-II Differential Entropy Scratch
AEPproof of theorem 4.4.
1: First,
p(A(n)ε ) =
∫x1:n∈A(n)
ε
f(x1, . . . , xn)dx1 . . . dxn (11)
= Pr
(∣∣∣− 1
nf(x1, x2, . . . , xn)− h(X)
∣∣∣ ≤ ε) ≥ 1− ε (12)
for big enough n which follows from the WLLN.2: Next, we have
1 =
∫Snf(x1, . . . , xn)dx1 . . . dxn ≥
∫A
(n)ε
f(x1, . . . , xn)dx1 . . . dxn
(13)
≥∫A
(n)ε
2−n(h(X)+ε)dx1:n = 2−n(h(X)+ε) Vol(A(n)ε ) (14)
⇒ Vol(A(n)ε ) ≤ 2n(h(X)+ε). . . .
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 31
Logistics Logistics IT-II Differential Entropy Scratch
AEP
proof of theorem 4.4 cont.
Similarly,
1− ε ≤ Pr(A(n)ε ) =
∫A
(n)ε
f(x1:n)dx1:n (15)
≤∫A
(n)ε
2−n(h(X)−ε)dx1:n = 2−n(h(X)−ε) Vol(A(n)ε ) (16)
Like in the discrete case, A(n)ε is the smallest volume that contains,
essentially, all of the probability, and that volume is ≈ 2nh.
If we look at (2nh)1/n, we get a “side length” of 2h.
So, −∞ < h <∞ is a meaningful range for entropy since it is theexponent of the equivalent side length of the n-D volume.
Large negative entropy just means small volume.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 32
Logistics Logistics IT-II Differential Entropy Scratch
Differential vs. Discrete Entropy
Let X ∼ f(x), and divide the range of X up into bins of length ∆.
E.g., quantize the range of X using n bits, so that ∆ = 2−n.
We can then view this as follows:
f(x) ∆ = 2−n
→|∆ |←
x
Mean value theorem, i.e., that if continuous within bin ∃xi such that
f(xi) =1
∆
∫ (i+1)∆
i∆f(x)dx (17)
f(xi)
xi∆ (i+1)∆xi
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 33
Logistics Logistics IT-II Differential Entropy Scratch
Differential vs. Discrete Entropy
Create a quantized random variable X∆ having those values so that
X∆ = xi if i∆ ≤ X < (i+ 1)∆ (18)
This gives a discrete distribution
Pr(X∆ = xi) = pi =
∫ (i+1)∆
i∆f(x)dx = ∆f(xi) (19)
and we can calculate the entropy
H(X∆) = −∞∑
i=−∞pi log pi = −
∑i
f(xi)∆ log(f(xi)∆) (20)
= −∑i
∆f(xi) log f(xi)−∑i
f(xi)∆ log ∆ (21)
= −∑i
∆f(xi) log f(xi)− log ∆ (22)
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 34
Logistics Logistics IT-II Differential Entropy Scratch
Differential vs. Discrete Entropy
This follows since (as expected)∑i
∆f(xi) = ∆∑i
1
∆
∫ (i+1)∆
i∆f(x)dx = ∆
1
∆
∫f(x)dx = 1
(23)
Also, as ∆→ 0, we have − log ∆→∞ and (assuming all isintegrable in a Riemannian sense)
−∑i
∆f(xi) log f(xi)→ −∫f(x) log f(x)dx (24)
So, H(X∆) + log ∆→ h(f) as ∆→ 0.Loosely, h(f) ≈ H(X∆) + log ∆ and for an n-bit quantization with∆ = 2−n, we have
H(X∆) ≈ h(f)− log ∆ = h(f) + n (25)
This means that as n→∞, H(X∆) gets larger. Why?
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 35
Logistics Logistics IT-II Differential Entropy Scratch
Differential vs. Discrete Entropy
This makes sense. We start with a continuous random variable Xand quantize it at an n-bit accuracy.
For a discrete representation to represent 2n values, we expect theentropy to go up with n, and as n gets large so would the entropy,but then adjusted by h(X).
H(X∆) is the number of bits to describe this n-bit equally spacedquantization of the continuous random variable X.
H(X∆) ≈= h(f) + n says that it might take either more than nbits to describe X at n-bit accuracy, or less than n bits to describeX at n-bit accuracy.
If X is very concentrated h(f) < 0 then fewer bits. If X is veryspread out, then more than n bits.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 36
Logistics Logistics IT-II Differential Entropy Scratch
Joint Differential Entropy
Like discrete case, we have entropy for vectors of r.v.s
The joint differential entropy is defined as:
h(X1, X2, . . . , Xn) = −∫f(x1:n) log f(x1:n)dx1:n (26)
Conditional differential entropy
h(X|Y ) = −∫f(x, y) log f(x|y)dxdy = h(X,Y )− h(Y ) (27)
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 37
Logistics Logistics IT-II Differential Entropy Scratch
Entropy of a Multivariate Gaussian
When X is distributed according to a multivariate Gaussiandistribution, i.e.,
X ∼ N (µ,Σ) =1
|2πΣ|1/2 e− 1
2(x−µ)ᵀΣ−1(x−µ) (28)
then the entropy of X has a nice form, in particular
h(X) =1
2log[(2πe)n|Σ|
]bits (29)
Notice that the entropy is monotonically related to the determinantof the covariance matrix Σ and is not at all dependent on the meanµ.
The determinant is a form of spread, or dispersion of thedistribution.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 38
Logistics Logistics IT-II Differential Entropy Scratch
Entropy of a Multivariate Gaussian: Derivation
h(X) = −∫f(x)
[−1
2(x− µ)ᵀΣ−1(x− µ)− ln
((2π)n/2|Σ|1/2
)](30)
=1
2Ef[tr (x− µ)ᵀΣ−1(x− µ)
]+
1
2ln [(2π)n|Σ|] (31)
=1
2Ef[tr(x− µ)(x− µ)ᵀΣ−1
]+
1
2ln [(2π)n|Σ|] (32)
=1
2trEf [(x− µ)(x− µ)ᵀ] Σ−1 +
1
2ln [(2π)n|Σ|] (33)
=1
2tr ΣΣ−1 +
1
2ln [(2π)n|Σ|] (34)
=1
2tr I +
1
2ln [(2π)n|Σ|] (35)
=n
2+
1
2ln [(2π)n|Σ|] =
1
2ln [(2πe)n|Σ|] (36)
This uses the “trace trick”, that tr(ABC) = tr(CAB).Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 39
Logistics Logistics IT-II Differential Entropy Scratch
Relative Entropy/KL-Divergence & Mutual Information
The relative entropy (or Kullback-Leibler divergence) for continuousdistributions also has a familiar form
D(f ||g) =
∫f(x) log
f(x)
g(x)dx ≥ 0 (37)
We can, like in the discrete case, use Jensen’s inequality to provethe non-negativity of D(f ||g).
Mutual Information:
D(f(X,Y )||f(X)f(Y )) = I(X;Y ) = h(X)− h(X|Y ) (38)
= h(Y )− h(Y |X) ≥ 0 (39)
Thus, since I(X;Y ) ≥ 0 we have again that conditioning reducesentropy, i.e., h(Y ) ≥ h(Y |X).
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 40
Logistics Logistics IT-II Differential Entropy Scratch
Chain rules and moreWe still have chain rules
h(X1, X2, . . . , Xn) =∑i
h(Xi|X1:i−1) (40)
And bounds of the form
h(X1, X2, . . . , Xn) ≤∑i
h(Xi) (41)
For discrete entropy, we have monotonicity. I.e.,H(X1, X2, . . . , Xk) ≤ H(X1, X2, . . . , Xk, Xk+1). More generally
f(A) = H(XA) (42)
is monotonic non-decreasing in set A (i.e., f(A) ≤ f(B), ∀A ⊆ B).Is f(A) = h(XA) monotonic? No, consider Gaussian entropy with
diagonal Σ with small diagonal values. So h(X) = 12 log
[(2πe)n|Σ|
]can get smaller with more random variables.Similarly, when some variables independent, adding independentvariables with negative entropy can decrease overall entropy.
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 41
Logistics Logistics IT-II Differential Entropy Scratch
Scratch Paper
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 42
Logistics Logistics IT-II Differential Entropy Scratch
Scratch Paper
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 43
Logistics Logistics IT-II Differential Entropy Scratch
Scratch Paper
Prof. Jeff Bilmes EE515A/Spring 2012/Info Theory – Lecture 19 - March 27th, 2012 page 44