Information TheoryFrom a Functional Viewpoint
Jingbo Liu
Ph.D. Defense
December 12, 2017
Department of Electrical EngineeringPrinceton University
1/46
Overview of Main Contributions
A functional approach to converses in informationtheory;General functional-entropic duality theory needed forextensions of this approach to multiuser settings;Other applications of the duality theory.
2/46
Information Theory: A Bridge betweenInformation Measures and Operational Problems
Information MeasuresEntropyMutual informationCommon informationRelative entropyHypercontractivityInformation spectrum. . .
Operational ProblemsData compressionData transmissionRandomness generationHypothesis testingData hidingComputation complexity. . .
3/46
Information measuresObservables:
Integrals of functions
Observables:Measures of sets
convex duality
data processing
change of measure
data processing
change of measure
take indicators(or the alike)
4/46
Information TheoryIn the Non-vanishing Error Regime
Achievability: basically well-known.
The study of second-order asymptotics for multi-terminalproblems is at its infancy. . . . The primary difficulty is ourinability to deal, in a systematic and principled way, withauxiliary random variables for the (strong) converse part. Thus,genuinely new non-asymptotic converses need to be developed. . .
—Tan, F&T monograph, “Open problems and challenges.”
5/46
Weak Converse; Strong Converse;Second-Order Converse
Weak converse: infeasibility of coding rate for vanishing errorprobability;
Strong converse: infeasibility of coding rate for any non-vanishingerror probability in (0, 1).
Second-Order converse: behavior of the second-order term in thecoding rate when the error probability is nonvanishing (typicallyO(√n)).
6/46
Comparison of Strong Converse Techniques
Informationspectrum
(meta-converse)
Method oftypes (sphere-
packingbound)
Blowing-up(image size
characterization)
Beyond|Y| <∞ and
stationarymemoryless?
X 7 7
Second-orderterm
optimal optimal Oε(√n log
32 n)
Extension tomultiuser
more selected selected
all sourcechannel
networks withknown firstorder region
7/46
Example: Channel Coding
Figure: Shannon, “A Mathematical Theory of Communication,” 1948.
8/46
Channel Coding: Formulation
Notation: P (f) :=∫f dP .
Given:
Random transformation PY |X ;
Message W equiprobably selected from {1, . . . ,M}.Goal: find
Codebook {c1, . . . , cM}Decoding functions fm : Y → [0, 1], m = 1, . . . ,M
to minimize:
{Average error: 1− 1
M
∑Mm=1 PY |X=cm(fm)
Max error: 1−min1≤m≤M PY |X=cm(fm)
}.
9/46
Previous Converse Method 1:Fano’s Inequality
Real distribution: PY |XPXW .
If under PY PXW , errorprobability would be 1− 1/M .
supPX
I(PX , PY |X)
= supPX
D(PXY ‖PXPY )
≥ d(Pe
∥∥∥∥1− 1
M
)data processing
≥ (1− Pe) logM − 1
Informationmeasures
Integrals offunctions
Measures ofsets
d. p.
dataprocessing
10/46
Previous Converse Method 2:Information Spectrum (Change of Measure)
Real distribution: PY |XPXW .
If under PY PXW , correctprobability would be 1/M .
1− Pe − PXY[
dPXYd(PX × PY )
>M
γ
]
≤ PXY[no error
∧ dPXYd(PX × PY )
≤ M
γ
]
≤ M
γ(PX × PY ) [no error] change of measure
=M
γ· 1
M, ∀γ > 0.
Pe ≥ PXY[
dPXY
d(PX×PY ) ≤ Mγ
]− 1
γ
Informationmeasures
Integrals offunctions
Measures ofsets
c. m.
changeof m
easure
11/46
What Can We Do with Sets?
Union bounds;
Usually in dimensional analysis, Entropy ≈ log |sets|;Blowing-up lemma.
12/46
The Blowing-up Lemma
Define the r-blow-up of A ⊆ Yn as its r-neighborhood in the Hammingdistance:
Ar := {vn ∈ Yn : dn(vn,A) ≤ r}. (1)
Blowing-Up Lemma
Let xn, yn ∈ X n and d(xn, yn) be Hamming distance between them
Let A ⊆ X n. For l ≤ n, let Γl(A) = {xn : minyn∈A d(xn, yn) ≤ l}
A Γl(A)
Blowing up Lemma
Let Xn ∼ PXn =∏n
i=1 PXi and ǫn → 0 as n → ∞. There existδn, ηn → 0 as n → ∞ such that if PXn(A) ≥ 2−nǫn , thenPXn(Γnδn(A)) ≥ 1− ηn
A. El Gamal (Stanford University) Katalin Marton Withits 2010 5 / 9
A Ar
Lemma
Given any PX and εn → 0 as n→∞, there exist δn, ηn → 0 as n→∞such that for any A : P⊗nX [A] ≥ 2−nεn , we have P⊗nX [Anδn ] ≥ 1− ηn.
13/46
Blowing-up Lemma in Information Theory
Ahlswede-Gacs-Korner 1976: used Margulis’s result to prove theblowing-up lemma (motivated by the strong converse problem)
Csiszar-Korner 1981, 2011: used BUL to prove the strong conversefor all source and channel networks with known first order region.
Marton 1985: introduced a simple proof using transportation method
14/46
Concentration of Measurein High Dimensional Analysis
Solutions to numerous mathematical problems, as well as applications indata science:
sparse signal recovery, matrix completions
randomized algorithms
group testing
. . .
15/46
BUL Approach to Channel Coding
Deterministic decoders fm = 1Dm ; max error probability ε.
Step 1: Construct L-list code with Dm := (Dm)r.
Step 2: Data processing
I(Xn;Y n) ≥ d(Pe
∥∥∥∥1− L
M
)
≥ (1− Pe) logM
L− 1.
Step 3 (justify):lower bound Pe using ε, r (BUL)and upper bound L using r (counting argument).
I(Xn;Y n) ≥ logM −Oε(√n log
32 n)
16/46
The BUL Approach: Key Arguments
Key Observations: let r = 100√n log n. Consider any D.
Blowing-up: if P satisfies P⊗n[D] ≥ 1− ε, then
P⊗n[Ar] ≥ 1− oε(n−10).
Union bound:
L ≤ volume of Hamming ball of radius r
≤ exp(Oε(√n log
32 n)).
17/46
Data Processing ArgumentResponsible for Second-Order Sub-optimality
I(Xn;Y n) ≥ logM −Oε(√n log
32 n)
Claims:
The BUL and the union bound are asymptotically sharp.
Even if the blowing-up operation is replaced by some “fancyoperation”, cannot beat Oε(
√n log n).
The rescue: functional inequalities.
18/46
Gibbs Variational Formula: Convex Duality
D(P‖Q) := E[log
dP
dQ(X)
]
= supf{P (log f)− logQ(f)} Gibbs
where X ∼ P , f : X → [0,∞).
Legendre Transform:
Λ∗(P ) := supg{P (g)− Λ(g)}.
19/46
What Can We Do with Functions?
Richer mathematical structures:
Linear space structure
Norm of functions
Operators (linear operators, maximal operators. . . )
Relevant tools:
Convex analysis
Hypercontractivity and its reverse
Operator theory
20/46
Some Philosophical Underpinnings
Construction of measure theory from functional analysis (Rieszrepresentation Theorem).
Large deviation theory.
Functional versions of some geometric inequalities (e.g. thePrekopa-Leindler inequality and the Brunn-Minkowski inequality).
21/46
Functional Inequality in High Dimensional DataAnalysis
Functional inequalities, such as spectral inequalities, log-Sobolevinequalities and hypercontractivity, have been applied in:
Boolean influence,
discrete Fourier analysis,
learning,
hardness of approximation,
communication complexity,
analysis of Markov chain Monte Carlo
22/46
Simple and Semi-simple Markov Semigroups
Definition
(Tt)t≥0 is a simple semigroup with stationary measure P if
Tt : H+(Y)→ H+(Y), f 7→ e−tf + (1− e−t)P (f). (2)
Markov process interpretation: A particle sits at x at the timet = 0. Whenever a Poisson clock of rate 1 ticks, we replace the valueby a random sample ∼ P . Let Px,t be the distribution at t. Then
(Ttf)(x) := Px,t(f), ∀f. (3)
Semi-simple semigroup: in the i.i.d. case P ← P⊗n,consider:
Tt := [e−t + (1− e−t)P ]⊗n. (4)
23/46
Reverse Hypercontractivity of Markov Semigroups
Theorem (Reverse hypercontractivity [Mossel et al. 13])
Let (Tt)t≥0 be a simple semigroup or semi-simple semigroups. Then for all0 < p < 1, nonnegative f and t ≥ ln 1
1−p ,
‖Ttf‖0 := exp(P (log Ttf)) ≥ ‖f‖p. (5)
History: Borell proved RHC for Gaussian and symmetric Bernoulli.However RHC has a universal property in contrast to HC.Nevertheless, applications of RHC are far less known than HC.
24/46
New Proof of Channel CodingVia Convex Duality
Assume: max error probability ≤ ε.
I(Xn;Y n)
=1
M
M∑
m=1
D(PY n|Xn=cm‖PY n)
Gibbs≥ 1
M
M∑
m=1
[PY n|Xn=cm(ln Λfm)− lnPY n(Λfm)]
justify≥ O(
√n) ln(1− ε)− lnPY n
(sup
1
M
M∑
m=1
fm
)−O(
√n)
= lnM −O(√n).
Informationmeasures
Integrals offunctions
convex duality
25/46
Justification 1
Want: PY n|Xn=cm(ln Λfm) ≥ O(√n) lnPY n|Xn=cm(fm).
t :=√n;
Txn,t = ⊗ni=1[e−t + (1− e−t)PY |X=xi ];
Λf := supxn Txn,tf , ∀f .
Proof:
L.H.S. ≥ ln ‖Tcm,tfm‖L0(PY n|Xn=cm )
≥ ln ‖fm‖L1−e−t (PY n|Xn=cm )(Reverse hypercontractivity)
≥ 1
1− e−t lnPY n|Xn=cm(fm)
≥ O(√n) lnPY n|Xn=cm(fm).
26/46
Justification 2
Want: lnPY n
(1M
∑Mm=1 Λfm
)≤ O(
√n) + lnPY n
(sup 1
M
∑Mm=1 fm
).
t :=√n;
Txn,t = ⊗ni=1[e−t + (1− e−t)PY |X=xi];
Λf := supxn Txn,tf , ∀f ;
find ν on Y such that α := supx
∥∥∥dPY |X=x
d ν
∥∥∥∞<∞.
Proof: Since Λfm = supxn Txn,tfm ≤ ⊗ni=1[e−t + (1− e−t)αν]fm,
1
M
M∑
m=1
Λfm ≤ ⊗ni=1[e−t + (1− e−t)αν] · 1
M
M∑
m=1
fm
≤ [e−t + (1− e−t)α]n · sup
{1
M
M∑
m=1
fm
}.
27/46
Summary: Optimal Second-Order Fano
Theorem
Fix PY |X and positive integers n and M . If there exists c1, . . . , cM ∈ X nand disjoint D1, . . . ,DM ⊆ Yn such that the geometric average of thecorrect decoding probabilities over the codewords exceeds 1− ε, then
I(Xn;Y n) ≥ lnM − 2
√(α− 1)n ln
1
1− ε − ln1
1− ε , (6)
where Xn is equiprobable on {c1, . . . , cM}, Y n is its output fromPY n|Xn := P⊗nY |X , and α := exp(I∞(X;Y )).
28/46
Comparison between Blowing-up and MarkovSemigroups
Hamming weight of xn
1A 1Ant
T⊗nt 1A
Figure: Schematic comparison of 1A, 1Antand T⊗nt 1A, where A is the indicator
function of a Hamming ball.
29/46
Comparison of the BUL and the FunctionalApproach
BUL approach New approach
Connecting informationmeasures to observables
Data processingproperty
Convex duality
Lower bound w.r.t. agiven measure
Concentration ofmeasure
Reversehypercontractivity
Upper bound w.r.t. thereference measure
|Ar| ≤ |A||B(r)| use∥∥∥dPdQ
∥∥∥∞
Second-order term Oε(√n log
32 n)
O(√
n log 11−ε
)
(optimal in n, ε)
requirement |Y| <∞∥∥∥dPdQ
∥∥∥∞<∞
applicability to multiuser same
30/46
Gaussian counterpart: Ornstein-UhlenbeckSemigroup
Txn,tf(yn) := E[f(e−tyn + (1− e−t)xn +√
1− e−2tV n)]
Original
ynConvolution
ynDilation
yn
xn
xn
Figure: Illustration of the action of Txn,t. The original function (an indicatorfunction) is convolved with a Gaussian measure and then dilated (with center xn).
31/46
Optimal Second-Order Fano:Gaussian Case
Theorem
PY |X=x = N (x, σ2); c1, . . . , cM ∈ X n; geometric average of correctdecoding probabilities over the codewords exceeds 1− ε. Then
I(W ;Y n) ≥ lnM −√
2n ln1
1− ε − ln1
1− ε (7)
where W is equiprobable on {1, . . . ,M}, Xn = cW and Y n is the outputfrom PY n|Xn := P⊗nY |X .
32/46
Applications of Optimal Second-order Fano
The optimal second-order Fano allows us to improve (get optimalsecond-order term; under weaker assumption...) previous results on
Empirical distribution of good channel codes (previously [Polyanskiy,Verdu 2014]);
Degraded discrete broadcast channel (previously [Ahlswede, Gacs,Korner 1976]);
Gaussian broadcast channel (previously [Fong, Tan 2016]);
...
33/46
Extensions to multiuser settings:
General duality theory
34/46
Source Coding with Compressed Side Information
DecoderEncoder 2
Encoder 1
Y n
Xn
Y n
W2
W1
R1 := 1n log |W1|, R2 := 1
n log |W2|.Goal: Y n = Y n with high probability.
R1 ≥ I(U ;X);
R2 ≥ H(Y |U),
where U −X − Y .
35/46
CR Generation with One Communicator
T1 T2. . . Tm
T0X
K1 K2 Km
K
Y1 Y2 Ym
W1 W2 Wm
R := 1n log |K|, Rj := 1
n log |Wj |,j = 1, . . . ,m.
Goal: K = K1 = · · · = Km
equiprobable.
R ≤ I(U ;X);
Rl ≥ I(U ;X)− I(U ;Yl), 1 ≤ l ≤ m,where U −X − Y m.
36/46
Degraded Broadcast Channel
Transmitter
Receiver 1
Receiver 2
(W1,W2)
W1
W2
PY |X
PZ|X
Degradedness: PZ|X is the concatenation of PY |X and some PZ|Y .
R1 ≤ I(U ;Y |X)
R2 ≤ I(U ;Z)
for some PUX .
37/46
Bridge 1: from MI to Relative Entropy
Fix QXY , c > 0.
infµn : |µn−Q⊗n
X |<0.5supPXn
{cD(PY n‖Q⊗nY )−D(PXn‖µn)}
= n supPU|X
{cI(U ;Y )− I(U ;X)}+O(√n).
Ahlswede, Gacs, Korner, “Bounds on conditional probabilities withapplications in multi-user communication,” 1976;
Csiszar and Korner, “Information Theory: Coding Theorems forDiscrete Memoryless Systems,” 1981, 2011.
Liu, Courtade, Cuff, Verdu, “Smoothing Brascamp-Lieb inequalitiesand strong converses for common randomness generation,” 2016.
38/46
Bridge 2: from Relative Entropy to Observables
Previous idea (Ahlswede, Gacs, Korner, Csiszar, Marton. . . ): Apply dataprocessing and the following fact: let P equals Q conditioned on a set A
P [C] :=1
Q[A]Q[C ∩ A].
Then
D(P‖Q) = log1
Q[A].
New approach: use convex duality to convert an entropic inequalitydirectly, losslessly to a functional inequality.
39/46
Optimal Second-order Image-size
X
x : QY |X=x[A] ≥ 1− ε
x
Y
A
For “regular A”,
lnQ⊗nX [preimage of A]− c lnQ⊗nY [A]
≤ n supQU|X
{cI(U ;Y )− I(U ;X)}+O(√n).
Liu, Courtade, Cuff, Verdu, “Brascamp-Lieb Inequality and Its Reverse: AnInformation Theoretic View,” ISIT2016.
Liu, Courtade, Cuff, Verdu, “Smoothing Brascamp-Lieb inequalities andstrong converses for CR generation,” ISIT2016.
40/46
Image-size with a Reverse Channel
X
A
Y
B
Z
C
Figure: The tradeoff between the sizes of the two images.41/46
Key Duality Result
Theorem
Consider Polish spaces X , Y, Z, random transformations QY |X , QZ|X ,nonnegative measures νZ fully supported on Z and νY fully supported onY. For d ∈ R, the following statements are equivalent:
infg : QZ|X(ln g)≥QY |X(ln f)
νZ(g) ≤ edνY (f), ∀f ≥ 0; (8)
D(PZ‖νZ) + d ≥ D(PY ‖νY ), ∀PX , (9)
where PX → QY |X → PY and PX → QZ|X → PZ .
42/46
Fenchel-Legendre Duality Theory
Theorem. Let V be a topological vector space and V ∗ be its dual.Let f and g be convex functions X → R ∪ {+∞} satisfying certainregularity conditions. Then
supx∗∈V ∗
{−f∗(x∗)− g∗(−x∗)} = infx∈V{f(x) + g(x)}.
Villani, “Topics in Optimal Transportation.”43/46
Other Applications of the Duality Theory
Eγ(P‖Q) = supf{P (f)− γQ(f)}, f : X → [0, 1].
Liu, Cuff, Verdu, “Eγ-Resolvability,” ITTrans 17.
Liu, Cuff, Verdu, “One-Shot Mutual Covering Lemma and Marton’sInner Bound with a Common Message, ISIT 15.
Information-theoretic approach to Brascamp-Lieb inequalities
[Lieb 1990]
[Geng and Nair 2014]
[Liu, Cuff, Courtade, Verdu, 2015]
44/46
Summary
Conclusion: concentration of measure not the final say.
To be explored: secrecy, interactive settings, more than one helper. . .
45/46
Acknowledgements
Thesis CommitteeEmmanuel AbbeMark BravermanPaul Cuff (advisor)Sergio Verdu (advisor)
Thesis ReadersYuxin ChenH. Vincent PoorSergio Verdu (advisor)
46/46