Download - Information Theory From a Functional Viewpointweb.mit.edu/jingbo/www/reprints/Slides_Dissertation.pdf · Information Theory From a Functional Viewpoint Jingbo Liu Ph.D. Defense December

Information TheoryFrom a Functional Viewpoint

Jingbo Liu

Ph.D. Defense

December 12, 2017

Department of Electrical EngineeringPrinceton University

1/46

Overview of Main Contributions

A functional approach to converses in informationtheory;General functional-entropic duality theory needed forextensions of this approach to multiuser settings;Other applications of the duality theory.

2/46

Information Theory: A Bridge betweenInformation Measures and Operational Problems

Information MeasuresEntropyMutual informationCommon informationRelative entropyHypercontractivityInformation spectrum. . .

Operational ProblemsData compressionData transmissionRandomness generationHypothesis testingData hidingComputation complexity. . .

3/46

Information measuresObservables:

Integrals of functions

Observables:Measures of sets

convex duality

data processing

change of measure

data processing

change of measure

take indicators(or the alike)

4/46

Information TheoryIn the Non-vanishing Error Regime

Achievability: basically well-known.

The study of second-order asymptotics for multi-terminalproblems is at its infancy. . . . The primary difficulty is ourinability to deal, in a systematic and principled way, withauxiliary random variables for the (strong) converse part. Thus,genuinely new non-asymptotic converses need to be developed. . .

—Tan, F&T monograph, “Open problems and challenges.”

5/46

Weak Converse; Strong Converse;Second-Order Converse

Weak converse: infeasibility of coding rate for vanishing errorprobability;

Strong converse: infeasibility of coding rate for any non-vanishingerror probability in (0, 1).

Second-Order converse: behavior of the second-order term in thecoding rate when the error probability is nonvanishing (typicallyO(√n)).

6/46

Comparison of Strong Converse Techniques

Informationspectrum

(meta-converse)

Method oftypes (sphere-

packingbound)

Blowing-up(image size

characterization)

Beyond|Y| <∞ and

stationarymemoryless?

X 7 7

Second-orderterm

optimal optimal Oε(√n log

32 n)

Extension tomultiuser

more selected selected

all sourcechannel

networks withknown firstorder region

7/46

Example: Channel Coding

Figure: Shannon, “A Mathematical Theory of Communication,” 1948.

8/46

Channel Coding: Formulation

Notation: P (f) :=∫f dP .

Given:

Random transformation PY |X ;

Message W equiprobably selected from {1, . . . ,M}.Goal: find

Codebook {c1, . . . , cM}Decoding functions fm : Y → [0, 1], m = 1, . . . ,M

to minimize:

{Average error: 1− 1

M

∑Mm=1 PY |X=cm(fm)

Max error: 1−min1≤m≤M PY |X=cm(fm)

}.

9/46

Previous Converse Method 1:Fano’s Inequality

Real distribution: PY |XPXW .

If under PY PXW , errorprobability would be 1− 1/M .

supPX

I(PX , PY |X)

= supPX

D(PXY ‖PXPY )

≥ d(Pe

∥∥∥∥1− 1

M

)data processing

≥ (1− Pe) logM − 1

Informationmeasures

Integrals offunctions

Measures ofsets

d. p.

dataprocessing

10/46

Previous Converse Method 2:Information Spectrum (Change of Measure)

Real distribution: PY |XPXW .

If under PY PXW , correctprobability would be 1/M .

1− Pe − PXY[

dPXYd(PX × PY )

>M

γ

]

≤ PXY[no error

∧ dPXYd(PX × PY )

≤ M

γ

]

≤ M

γ(PX × PY ) [no error] change of measure

=M

γ· 1

M, ∀γ > 0.

Pe ≥ PXY[

dPXY

d(PX×PY ) ≤ Mγ

]− 1

γ

Informationmeasures


Measures ofsets

c. m.

changeof m

easure

11/46

What Can We Do with Sets?

Union bounds;

Usually in dimensional analysis, Entropy ≈ log |sets|;Blowing-up lemma.

12/46

The Blowing-up Lemma

Define the r-blow-up of A ⊆ Yn as its r-neighborhood in the Hammingdistance:

Ar := {vn ∈ Yn : dn(vn,A) ≤ r}. (1)

Blowing-Up Lemma

Let xn, yn ∈ X n and d(xn, yn) be Hamming distance between them

Let A ⊆ X n. For l ≤ n, let Γl(A) = {xn : minyn∈A d(xn, yn) ≤ l}

A Γl(A)

Blowing up Lemma

Let Xn ∼ PXn =∏n

i=1 PXi and ǫn → 0 as n → ∞. There existδn, ηn → 0 as n → ∞ such that if PXn(A) ≥ 2−nǫn , thenPXn(Γnδn(A)) ≥ 1− ηn

A. El Gamal (Stanford University) Katalin Marton Withits 2010 5 / 9

A Ar

Lemma

Given any PX and εn → 0 as n→∞, there exist δn, ηn → 0 as n→∞such that for any A : P⊗nX [A] ≥ 2−nεn , we have P⊗nX [Anδn ] ≥ 1− ηn.

13/46

Blowing-up Lemma in Information Theory

Ahlswede-Gacs-Korner 1976: used Margulis’s result to prove theblowing-up lemma (motivated by the strong converse problem)

Csiszar-Korner 1981, 2011: used BUL to prove the strong conversefor all source and channel networks with known first order region.

Marton 1985: introduced a simple proof using transportation method

14/46

Concentration of Measurein High Dimensional Analysis

Solutions to numerous mathematical problems, as well as applications indata science:

sparse signal recovery, matrix completions

randomized algorithms

group testing

. . .

15/46

BUL Approach to Channel Coding

Deterministic decoders fm = 1Dm ; max error probability ε.

Step 1: Construct L-list code with Dm := (Dm)r.

Step 2: Data processing

I(Xn;Y n) ≥ d(Pe

∥∥∥∥1− L

M

)

≥ (1− Pe) logM

L− 1.

Step 3 (justify):lower bound Pe using ε, r (BUL)and upper bound L using r (counting argument).

I(Xn;Y n) ≥ logM −Oε(√n log

32 n)

16/46

The BUL Approach: Key Arguments

Key Observations: let r = 100√n log n. Consider any D.

Blowing-up: if P satisfies P⊗n[D] ≥ 1− ε, then

P⊗n[Ar] ≥ 1− oε(n−10).

Union bound:

L ≤ volume of Hamming ball of radius r

≤ exp(Oε(√n log

32 n)).

17/46

Data Processing ArgumentResponsible for Second-Order Sub-optimality

I(Xn;Y n) ≥ logM −Oε(√n log

32 n)

Claims:

The BUL and the union bound are asymptotically sharp.

Even if the blowing-up operation is replaced by some “fancyoperation”, cannot beat Oε(

√n log n).

The rescue: functional inequalities.

18/46

Gibbs Variational Formula: Convex Duality

D(P‖Q) := E[log

dP

dQ(X)

]

= supf{P (log f)− logQ(f)} Gibbs

where X ∼ P , f : X → [0,∞).

Legendre Transform:

Λ∗(P ) := supg{P (g)− Λ(g)}.

19/46

What Can We Do with Functions?

Richer mathematical structures:

Linear space structure

Norm of functions

Operators (linear operators, maximal operators. . . )

Relevant tools:

Convex analysis

Hypercontractivity and its reverse

Operator theory

20/46

Some Philosophical Underpinnings

Construction of measure theory from functional analysis (Rieszrepresentation Theorem).

Large deviation theory.

Functional versions of some geometric inequalities (e.g. thePrekopa-Leindler inequality and the Brunn-Minkowski inequality).

21/46

Functional Inequality in High Dimensional DataAnalysis

Functional inequalities, such as spectral inequalities, log-Sobolevinequalities and hypercontractivity, have been applied in:

Boolean influence,

discrete Fourier analysis,

learning,

hardness of approximation,

communication complexity,

analysis of Markov chain Monte Carlo

22/46

Simple and Semi-simple Markov Semigroups

Definition

(Tt)t≥0 is a simple semigroup with stationary measure P if

Tt : H+(Y)→ H+(Y), f 7→ e−tf + (1− e−t)P (f). (2)

Markov process interpretation: A particle sits at x at the timet = 0. Whenever a Poisson clock of rate 1 ticks, we replace the valueby a random sample ∼ P . Let Px,t be the distribution at t. Then

(Ttf)(x) := Px,t(f), ∀f. (3)

Semi-simple semigroup: in the i.i.d. case P ← P⊗n,consider:

Tt := [e−t + (1− e−t)P ]⊗n. (4)

23/46

Reverse Hypercontractivity of Markov Semigroups

Theorem (Reverse hypercontractivity [Mossel et al. 13])

Let (Tt)t≥0 be a simple semigroup or semi-simple semigroups. Then for all0 < p < 1, nonnegative f and t ≥ ln 1

1−p ,

‖Ttf‖0 := exp(P (log Ttf)) ≥ ‖f‖p. (5)

History: Borell proved RHC for Gaussian and symmetric Bernoulli.However RHC has a universal property in contrast to HC.Nevertheless, applications of RHC are far less known than HC.

24/46

New Proof of Channel CodingVia Convex Duality

Assume: max error probability ≤ ε.

I(Xn;Y n)

=1

M

M∑

m=1

D(PY n|Xn=cm‖PY n)

Gibbs≥ 1

M

M∑

m=1

[PY n|Xn=cm(ln Λfm)− lnPY n(Λfm)]

justify≥ O(

√n) ln(1− ε)− lnPY n

(sup

1

M

M∑

m=1

fm

)−O(

√n)

= lnM −O(√n).

Informationmeasures


convex duality

25/46

Justification 2

Want: lnPY n

(1M

∑Mm=1 Λfm

)≤ O(

√n) + lnPY n

(sup 1

M

∑Mm=1 fm

).

t :=√n;

Txn,t = ⊗ni=1[e−t + (1− e−t)PY |X=xi];

Λf := supxn Txn,tf , ∀f ;

find ν on Y such that α := supx

∥∥∥dPY |X=x

d ν

∥∥∥∞<∞.

Proof: Since Λfm = supxn Txn,tfm ≤ ⊗ni=1[e−t + (1− e−t)αν]fm,

1

M

M∑

m=1

Λfm ≤ ⊗ni=1[e−t + (1− e−t)αν] · 1

M

M∑

m=1

fm

≤ [e−t + (1− e−t)α]n · sup

{1

M

M∑

m=1

fm

}.

27/46

Summary: Optimal Second-Order Fano

Theorem

Fix PY |X and positive integers n and M . If there exists c1, . . . , cM ∈ X nand disjoint D1, . . . ,DM ⊆ Yn such that the geometric average of thecorrect decoding probabilities over the codewords exceeds 1− ε, then

I(Xn;Y n) ≥ lnM − 2

√(α− 1)n ln

1

1− ε − ln1

1− ε , (6)

where Xn is equiprobable on {c1, . . . , cM}, Y n is its output fromPY n|Xn := P⊗nY |X , and α := exp(I∞(X;Y )).

28/46

Comparison between Blowing-up and MarkovSemigroups

Hamming weight of xn

1A 1Ant

T⊗nt 1A

Figure: Schematic comparison of 1A, 1Antand T⊗nt 1A, where A is the indicator

function of a Hamming ball.

29/46

Comparison of the BUL and the FunctionalApproach

BUL approach New approach

Connecting informationmeasures to observables

Data processingproperty

Convex duality

Lower bound w.r.t. agiven measure

Concentration ofmeasure

Reversehypercontractivity

Upper bound w.r.t. thereference measure

|Ar| ≤ |A||B(r)| use∥∥∥dPdQ

∥∥∥∞

Second-order term Oε(√n log

32 n)

O(√

n log 11−ε

)

(optimal in n, ε)

requirement |Y| <∞∥∥∥dPdQ

∥∥∥∞<∞

applicability to multiuser same

30/46

Gaussian counterpart: Ornstein-UhlenbeckSemigroup

Txn,tf(yn) := E[f(e−tyn + (1− e−t)xn +√

1− e−2tV n)]

Original

ynConvolution

ynDilation

yn

xn

xn

Figure: Illustration of the action of Txn,t. The original function (an indicatorfunction) is convolved with a Gaussian measure and then dilated (with center xn).

31/46

Optimal Second-Order Fano:Gaussian Case

Theorem

PY |X=x = N (x, σ2); c1, . . . , cM ∈ X n; geometric average of correctdecoding probabilities over the codewords exceeds 1− ε. Then

I(W ;Y n) ≥ lnM −√

2n ln1

1− ε − ln1

1− ε (7)

where W is equiprobable on {1, . . . ,M}, Xn = cW and Y n is the outputfrom PY n|Xn := P⊗nY |X .

32/46

Applications of Optimal Second-order Fano

The optimal second-order Fano allows us to improve (get optimalsecond-order term; under weaker assumption...) previous results on

Empirical distribution of good channel codes (previously [Polyanskiy,Verdu 2014]);

Degraded discrete broadcast channel (previously [Ahlswede, Gacs,Korner 1976]);

Gaussian broadcast channel (previously [Fong, Tan 2016]);

...

33/46

Extensions to multiuser settings:

General duality theory

34/46

CR Generation with One Communicator

T1 T2. . . Tm

T0X

K1 K2 Km

K

Y1 Y2 Ym

W1 W2 Wm

R := 1n log |K|, Rj := 1

n log |Wj |,j = 1, . . . ,m.

Goal: K = K1 = · · · = Km

equiprobable.

R ≤ I(U ;X);

Rl ≥ I(U ;X)− I(U ;Yl), 1 ≤ l ≤ m,where U −X − Y m.

36/46

Bridge 1: from MI to Relative Entropy

Fix QXY , c > 0.

infµn : |µn−Q⊗n

X |<0.5supPXn

{cD(PY n‖Q⊗nY )−D(PXn‖µn)}

= n supPU|X

{cI(U ;Y )− I(U ;X)}+O(√n).

Ahlswede, Gacs, Korner, “Bounds on conditional probabilities withapplications in multi-user communication,” 1976;

Csiszar and Korner, “Information Theory: Coding Theorems forDiscrete Memoryless Systems,” 1981, 2011.

Liu, Courtade, Cuff, Verdu, “Smoothing Brascamp-Lieb inequalitiesand strong converses for common randomness generation,” 2016.

38/46

Bridge 2: from Relative Entropy to Observables

Previous idea (Ahlswede, Gacs, Korner, Csiszar, Marton. . . ): Apply dataprocessing and the following fact: let P equals Q conditioned on a set A

P [C] :=1

Q[A]Q[C ∩ A].

Then

D(P‖Q) = log1

Q[A].

New approach: use convex duality to convert an entropic inequalitydirectly, losslessly to a functional inequality.

39/46

Optimal Second-order Image-size

X

x : QY |X=x[A] ≥ 1− ε

x

Y

A

For “regular A”,

lnQ⊗nX [preimage of A]− c lnQ⊗nY [A]

≤ n supQU|X

{cI(U ;Y )− I(U ;X)}+O(√n).

Liu, Courtade, Cuff, Verdu, “Brascamp-Lieb Inequality and Its Reverse: AnInformation Theoretic View,” ISIT2016.

Liu, Courtade, Cuff, Verdu, “Smoothing Brascamp-Lieb inequalities andstrong converses for CR generation,” ISIT2016.

40/46

Image-size with a Reverse Channel

X

A

Y

B

Z

C

Figure: The tradeoff between the sizes of the two images.41/46

Key Duality Result

Theorem

Consider Polish spaces X , Y, Z, random transformations QY |X , QZ|X ,nonnegative measures νZ fully supported on Z and νY fully supported onY. For d ∈ R, the following statements are equivalent:

infg : QZ|X(ln g)≥QY |X(ln f)

νZ(g) ≤ edνY (f), ∀f ≥ 0; (8)

D(PZ‖νZ) + d ≥ D(PY ‖νY ), ∀PX , (9)

where PX → QY |X → PY and PX → QZ|X → PZ .

42/46

Fenchel-Legendre Duality Theory

Theorem. Let V be a topological vector space and V ∗ be its dual.Let f and g be convex functions X → R ∪ {+∞} satisfying certainregularity conditions. Then

supx∗∈V ∗

{−f∗(x∗)− g∗(−x∗)} = infx∈V{f(x) + g(x)}.

Villani, “Topics in Optimal Transportation.”43/46

Other Applications of the Duality Theory

Eγ(P‖Q) = supf{P (f)− γQ(f)}, f : X → [0, 1].

Liu, Cuff, Verdu, “Eγ-Resolvability,” ITTrans 17.

Liu, Cuff, Verdu, “One-Shot Mutual Covering Lemma and Marton’sInner Bound with a Common Message, ISIT 15.

Information-theoretic approach to Brascamp-Lieb inequalities

[Lieb 1990]

[Geng and Nair 2014]

[Liu, Cuff, Courtade, Verdu, 2015]

44/46

Summary

Conclusion: concentration of measure not the final say.

To be explored: secrecy, interactive settings, more than one helper. . .

45/46

Acknowledgements

Thesis CommitteeEmmanuel AbbeMark BravermanPaul Cuff (advisor)Sergio Verdu (advisor)

Thesis ReadersYuxin ChenH. Vincent PoorSergio Verdu (advisor)

46/46

Download - Information Theory From a Functional Viewpointweb.mit.edu/jingbo/www/reprints/Slides_Dissertation.pdf · Information Theory From a Functional Viewpoint Jingbo Liu Ph.D. Defense December

Top Related