measure concentration

119
Math 710: Measure Concentration Alexander Barvinok These are day-by-day lecture notes, not really proofread or edited and with no references, for the course that I taught in Winter 2005 term. The course also included presentations given by the participants. The notes for the presentations are not included here, but the topics are listed on pages 113–115. 1

Upload: yousof65

Post on 18-Jul-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Measure Concentration

TRANSCRIPT

Page 1: Measure Concentration

Math 710: Measure Concentration

Alexander Barvinok

These are day-by-day lecture notes, not really proofread or edited and with noreferences, for the course that I taught in Winter 2005 term. The course alsoincluded presentations given by the participants. The notes for the presentationsare not included here, but the topics are listed on pages 113–115.

1

Page 2: Measure Concentration

Contents

1. Introduction.......................................................................................32. “Condensing” the sphere from the Gaussian measure.......................43. The Johnson-Lindenstrauss “Flattening” Lemma..............................84. Concentration in the Boolean Cube..................................................135. Isoperimetric inequalities in the Boolean cube: a soft version..........196. The Hamming ball............................................................................237. The martingale method and its application

to the Boolean cube.......................................................................268. Concentration in the product spaces. An application:

the law of large numbers for bounded functions............................329. How to sharpen martingale inequalities?..........................................3510. Concentration and isoperimetry......................................................3911. Why do we care: some examples leading to discrete measure

concentration questions..................................................................4412. Large deviations and isoperimetry...................................................4413. Gaussian measure on Euclidean space as a limit of

the projection of the uniform measure on the sphere.....................5314. Isoperimetric inequalities for the sphere and

for the Gaussian measure................................................................5715. Another concentration inequality for the Gaussian measure............6316. Smoothing Lipschitz functions..........................................................6717. Concentration on the sphere and functions

almost constant on a subspace.........................................................6918. Dvoretzky’s Theorem........................................................................7119. The Prekopa-Leindler Inequality.......................................................7720. Logarithmically concave measures.

The Brunn-Minkowski Inequality.....................................................7921. The isoperimetric inequality for the Lebesgue measure in Rn...........8122. The concentration on the sphere and other

strictly convex surfaces.....................................................................8223. Concentration for the Gaussian measure and other

strictly log-concave measures............................................................8524. Concentration for general log-concave measures................................8625. An application: reverse Holder inequalities for norms.......................8826. Log-concave measures as projections of the Lebesgue measure.........9427. The needle decomposition technique for proving

dimension-free inequalities.................................................................9828. Some applications: proving reverse Holder inequalities.....................10629. Graphs, eigenvalues, and isoperimetry...............................................11030. Expansion and concentration.............................................................117

2

Page 3: Measure Concentration

Lecture 1. Wednesday, January 5

1. Introduction

Measure concentration is a fairly general phenomenon, which asserts that a “rea-sonable” function f : X −→ R defined on a “large” probability space X “almostalways” takes values that are “very close” to the average value of f on X. Here isan introductory example, which allows us to remove the quotation marks, althoughwe don’t prove anything yet.

(1.1) Example: concentration on the sphere. Let Rn be the n-dimensionalEuclidean space of all n-tuples x = (ξ1, . . . , ξn) with the scalar product

〈x, y〉 =n∑i=1

ξiηi for x = (ξ1, . . . , ξn) and y = (η1, . . . , ηn),

and the norm

‖x‖ =√〈x, x〉 =

√√√√ n∑i=1

ξ2i for x = (ξ1, . . . , ξn).

LetSn−1 =

x ∈ Rn : ‖x‖ = 1

be the unit sphere. We introduce the geodesic metric on the sphere

dist(x, y) = arccos 〈x, y〉,

the distance between x and y being the angle between x and y (check that this isindeed a metric) and the rotation invariant probability measure µ (the existenceand uniqueness of such a measure is pretty intuitive, although not so easy to proveformally).

Let f : Sn−1 −→ R be a function which is 1-Lipschitz:

|f(x)− f(y)| ≤ dist(x, y) for all x, y ∈ Sn−1.

Let mf be the median of f , that is the number such that

µx : f(x) ≥ mf

≥ 1

2and µ

x : f(x) ≤ mf

≥ 1

2

(prove that for a Lipschitz function f , the median mf exists and unique). Then,for any ε ≥ 0,

µx ∈ Sn−1 : |f(x)−mf | ≤ ε

≥ 1−

√π

2e−ε

2(n−2)/2.

3

Page 4: Measure Concentration

In particular, the value of f(x) in a “typical” point x ∈ Sn−1 deviates from the me-dian only by about 1/

√n: that is, if εn

√n −→ +∞, however slowly, the probability

that f(x) does not deviate from mf by more than εn tends to 1.

PROBLEM.Check the assertion for a linear function f : Sn−1 −→ R, f(x) = 〈a, x〉 for some

a ∈ Rn.

Although the sphere looks like a nice and intuitive object, it is not so easy towork with. The Gaussian measure, although not so intuitive, is much nicer to workwith analytically.

2. “Condensing” the Sphere from the Gaussian Measure

As we all know1√2π

∫ +∞

−∞e−x

2/2 dx = 1.

The probability measure on R1 with the density

1√2πe−x

2/2

is called the standard Gaussian measure on the line R1.We define the standard Gaussian measure γn on Rn as the probability measure

with the density1

(2π)n/2e−‖x‖

2/2,

soγn(A) =

1(2π)n/2

∫A

e−‖x‖2/2 dx

for a measurable set A. We have

γn(Rn) = (2π)−n/2∫

Rn

e−‖x‖2/2 dx = 1.

Our immediate goal is to prove that for an overwhelming majority of vectors x ∈ Rnwith respect to the Gaussian measure γn, we have ‖x‖ ∼

√n. In other words, from

the point of view of the standard Gaussian measure, the space Rn “looks like” thesphere of the radius

√n. This will be our first rigorous concentration result and

the proof will demonstrate some important technique.

(2.1) How to prove inequalities: the Laplace transform method. Let Xbe a space with a probability measure µ and let f : X −→ R be a function.

Suppose we want to estimate

µx : f(x) ≥ a

for some a ∈ R.

4

Page 5: Measure Concentration

Here is an amazingly efficient way to do it. Let us choose a λ > 0. Then

f(x) ≥ a =⇒ λf ≥ λa =⇒ eλf ≥ eλa.

Now, eλf is a positive function on X and eλa is a positive number, hence∫X

eλf dµ ≥∫x:f(x)≥a

eλf dµ ≥ µx ∈ X : f(x) ≥ a

eλa,

from whichµx : f(x) ≥ a

≤ e−λa

∫X

eλf dµ.

Often, we can compute exactly or estimate the value of the integral∫Xeλf dµ. At

that point, we “tune up” λ so that we get the strongest inequality possible.Suppose we want to estimate

µx : f(x) ≤ a

for some a ∈ R.

Let us choose a λ > 0. Then

f(x) ≤ a =⇒ −λf(x) ≥ −λa =⇒ e−λf ≥ e−λa.

Now, e−λf is still a positive function and e−λa is still a positive number, hence∫X

e−λf dµ ≥∫x:f(x)≤a

e−λf dµ ≥ µx ∈ X : f(x) ≤ a

e−λa,

from whichµx : f(x) ≤ a

≤ eλa

∫X

e−λf dµ.

Often, we can compute exactly or estimate the value of the integral∫Xe−λf dµ.

At that point, we “tune up” λ so that we get the strongest inequality possible.

Let us apply the method of Section 2.1 to the Gaussian measure.

(2.2) Proposition.(1) For any δ ≥ 0

γn

x ∈ Rn : ‖x‖2 ≥ n+ δ

≤(

n

n+ δ

)−n/2e−δ/2.

(2) For any 0 < δ ≤ n

γn

x ∈ Rn : ‖x‖2 ≤ n− δ

≤(

n

n− δ

)−n/2eδ/2.

A more or less straightforward corollary:5

Page 6: Measure Concentration

(2.3) Corollary. For any 0 < ε < 1

γn

x ∈ Rn : ‖x‖2 ≥ n

1− ε

≤ e−ε

2n/4 and

γn

x ∈ Rn : ‖x‖2 ≤ (1− ε)n

≤ e−ε

2n/4.

Now to proofs.

Proof of Proposition 2.2. To prove Part 1, let us choose a 0 < λ < 1 (to be adjustedlater). Then

‖x‖2 ≥ n+ δ =⇒ λ‖x‖2/2 ≥ λ(n+ δ)/2 =⇒ eλ‖x‖2/2 ≥ eλ(n+δ)/2.

Reasoning as in Section 2.1, we get

γn

x ∈ Rn : ‖x‖2 ≥ n+ δ

≤ e−λ(n+δ)/2

∫Rn

eλ‖x‖2dγn

=e−λ(n+δ)/2(2π)−n/2∫

Rn

e(λ−1)‖x‖2/2 dx.

Quite handily, the integral can be computed exactly, because it factors into theproduct of 1-dimensional integrals:

(2π)−n/2∫

Rn

e(λ−1)‖x‖2/2 dx =(

1√2π

∫ +∞

−∞e(λ−1)x2/2 dx

)n= (1− λ)−n/2

(to compute the 1-dimensional integral, make the substitution x = y/√

1− λ).Hence

γn

x ∈ Rn : ‖x‖2 ≥ n+ δ

≤ (1− λ)−n/2e−λ(n+δ)/2.

Now we choose λ = δ/(n+ δ) and Part 1 follows.To prove Part 2, let us choose λ > 0 (to be adjusted later). Then

‖x‖2 ≤ n− δ =⇒ −λ‖x‖2/2 ≥ −λ(n− δ)/2 =⇒ e−λ‖x‖2/2 ≥ e−λ(n−δ)/2.

Reasoning as in Section 2.1, we get

γn

x ∈ Rn : ‖x‖2 ≤ n− δ

≤ eλ(n−δ)/2

∫Rn

e−λ‖x‖2dγn

=eλ(n−δ)/2(2π)−n/2∫

Rn

e−(λ+1)‖x‖2/2 dx.

Again, the integral is computed exactly:

(2π)−n/2∫

Rn

e−(λ+1)‖x‖2/2 dx =(

1√2π

∫ +∞

−∞e−(λ+1)x2/2 dx

)n= (1 + λ)−n/2

6

Page 7: Measure Concentration

(to compute the 1-dimensional integral, make the substitution x = y/√

1 + λ).Hence

γn

x ∈ Rn : ‖x‖2 ≤ n− δ

≤ (1 + λ)−n/2eλ(n−δ)/2.

Now we choose λ = δ/(n− δ) and Part 2 follows.

Lecture 2. Friday, January 7

2. “Condensing” the Sphere from the Gaussian Measure, Continued

Proof of Corollary 2.3. Let us choose δ = nε/(1 − ε) in Part 1 of Proposition 2.2.Then n+ δ = n/(1− ε) and

γn

x ∈ Rn : ‖x‖2 ≥ n

1− ε

≤(1− ε)−n/2 exp

−n

1− ε

= exp

−n

2

( ε

1− ε+ ln(1− ε)

).

Now,

ε

1− ε= ε+ ε2 + ε3 + . . . and ln(1− ε) = −ε− ε2

2− ε3

3− . . . ,

from whichε

1− ε+ ln(1− ε) ≥ ε2

2and

γn

x ∈ Rn : ‖x‖2 ≥ n

1− ε

≤ e−ε

2n/4,

as promised.Similarly, let us choose δ = nε in Part 2 of of Proposition 2.2. Then n − δ =

(1− ε)n and

γn

x ∈ Rn : ‖x‖2 ≤ (1− ε)n

≤ (1− ε)n/2eεn/2 = exp

n2

(ln(1− ε) + ε

).

Now,

ln(1− ε) = −ε− ε2

2− ε3

3− . . . ,

from which

ln(1− ε) + ε ≤ −ε2

2and

γn

x ∈ Rn : ‖x‖2 ≤ (1− ε)n

≤ e−ε

2n/4,

7

Page 8: Measure Concentration

as promised.

One geometric interpretation of Corollary 2.3 is as follows: let us choose a se-quence ρn −→ +∞, however slowly, and consider a “fattening” of the sphere

x ∈ Rn :√n− ρn ≤ ‖x‖ ≤

√n+ ρn

.

Then the Gaussian measure γn of that “fattening” approaches 1 as n increases (letεn = ρn/

√n in Corollary 2.3)

3. The Johnson-Lindenstrauss “Flattening” Lemma

The Gaussian measure behaves nicely with respect to orthogonal projections.

(3.1) Gaussian measures and projections. Let γn be the standard Gaussianmeasure in Rn and let L ⊂ Rn be a k-dimensional subspace. Let p : Rn −→ Lbe the orthogonal projection. Then the “push-forward” measure p(γn) is just thestandard Gaussian measure on L with the density (2π)−k/2e−‖x‖

2/2 for x ∈ L.Recall (or just be aware) that if we have a map φ : X −→ Y and a measure µ on

X then the push-forward measure ν = φ(µ) on Y is defined by ν(A) = µ(φ−1(A)

)for subsets A ⊂ Y . The reason why the push-forward of the Gaussian measure underorthogonal projections is the Gaussian measure on the image is as follows: since theGaussian density is invariant under orthogonal transformations of Rn, rotating thesubspace, if necessary, we may assume that L is the coordinate subspace consistingof the points (ξ1, . . . , ξk, 0, . . . , 0), where the last n−k coordinates are 0’s. Supposethat A ⊂ L is a measurable set and that B = p−1(A) is the preimage of A. Sincethe Gaussian density splits into the product of the coordinate Gaussian densities,we have

γn(B) =(2π)−n/2∫B

exp−(ξ21 + . . .+ ξ2n)/2

dξ1 · · · dξn

=(2π)−k/2∫A

exp−(ξ21 + . . .+ ξ2k)/2

dξ1 · · · dξk

×n∏

i=k+1

1√2π

∫ +∞

−∞e−ξ

2i /2 dξi = γk(A).

This fact that the projection of the standard Gaussian measure is the standardGaussian measure on the image is very useful and important. Often, we will wordit more informally as: “if vector x has the standard Gaussian distribution in Rn,its orthogonal projection y onto a given subspace L has the standard Gaussiandistribution in L”.

(3.2) Lemma. Let γn be the standard Gaussian measure in Rn and let L ⊂ Rnbe a k-dimensional subspace. For a vector x ∈ Rn, let xL denote the orthogonal

8

Page 9: Measure Concentration

projection of x onto L. Then, for any 0 < ε < 1,

γn

x ∈ Rn :

√n

k‖xL‖ ≥ (1− ε)−1‖x‖

≤ e−ε

2k/4 + e−ε2n/4 and

γn

x ∈ Rn :

√n

k‖xL‖ ≤ (1− ε)‖x‖

≤ e−ε

2k/4 + e−ε2n/4.

Proof. By Corollary 2.3,

γn

x ∈ Rn : ‖x‖ ≥

√(1− ε)n

≥ 1− e−ε

2n/4.

If x has the standard Gaussian distribution in Rn, then the projection xL has thestandard Gaussian distribution γk in L (cf. Section 3.1), and so by Corollary 2.3applied to L we have

γn

x ∈ Rn : ‖xL‖ ≤

√k

1− ε

≥ 1− e−ε

2k/4.

Therefore,

γn

x ∈ Rn :

√n

k‖xL‖ ≤ (1− ε)−1‖x‖

≥ 1− e−ε

2n/4 − e−ε2k/4.

Similarly,

γn

x ∈ Rn : ‖x‖ ≤

√n

1− ε

≥ 1− e−ε

2n/4 and

γn

x ∈ Rn : ‖xL‖ ≥

√(1− ε)k

≥ 1− e−ε

2k/4,

from which

γn

x ∈ Rn :

√n

k‖xL‖ ≥ (1− ε)‖x‖

≥ 1− e−ε

2n/4 − e−ε2k/4.

(3.3) The Gaussian measure in Rn and the uniform measure on thesphere. Let us consider the radial projection φ : Rn \ 0 −→ Sn−1, x 7−→ x/‖x‖.Not surprisingly, the push-forward of the standard Gaussian measure γn on Rn willbe the uniform probability measure on the sphere Sn−1. This is so because thethe Gaussian measure is rotation-invariant, so the push-forward must be rotationinvariant and there is a unique (Borel) rotation-invariant probability measure onSn−1. We don’t prove the existence and uniqueness of the rotation-invariant Borelprobability measure on Sn−1, instead we describe a simple procedure to sample apoint from this distribution: sample a random vector x from the Gaussian measureon Rn and project radially: x 7−→ x/‖x‖ (note that x 6= 0 with probability 1).

9

Page 10: Measure Concentration

(3.4) Corollary. Let µn be the rotation-invariant probability measure on the unitsphere Sn−1 and let L ⊂ Rn be a k-dimensional subspace. For a vector x ∈ Sn−1,let xL denote the orthogonal projection of x onto L. Then, for any 0 < ε < 1,

µn

x ∈ Sn−1 :

√n

k‖xL‖ ≥ (1− ε)−1‖x‖

≤ e−ε

2k/4 + e−ε2n/4 and

µn

x ∈ Sn−1 :

√n

k‖xL‖ ≤ (1− ε)‖x‖

≤ e−ε

2k/4 + e−ε2n/4.

Proof. The ratio ‖xL‖/‖x‖ does not change under the radial projection φ : x 7−→x/‖x‖. Hence the result follows from Lemma 3.2 and the discussion of Section 3.3above.

Now, instead of fixing a subspace L ⊂ Rn and choosing a random x ∈ Sn−1, wefix a vector x ∈ Sn−1 and choose a random k-dimensional subspace L ⊂ Rn. It isintuitively obvious that “nothing changes”.

(3.5) The orthogonal group, the Grassmannian, and the Gaussian mea-sure. To explain rigorously why “nothing changes” would require us to prove somebasic results about the existence and uniqueness of the Haar measure. This wouldlead us too far off the course. Instead, we present some informal discussion. LetGk(Rn) be the Grassmannian, that is the set of all k-dimensional subspaces of Rn.The set Gk(Rn) has a natural structure of a compact metric space: for example, onecan define the distance between two subspaces as the Hausdorff distance betweenthe unit balls in those subspaces. The metric is invariant under the action of theorthogonal group and the action is transitive on Gk(Rn). Therefore, there existsa unique (Borel) probability measure µn,k on Gk(Rn) invariant under the actionof the orthogonal group. Again, we don’t define this measure, instead we describea procedure to sample a random subspace with respect to µn,k: sample k vectorsx1, . . . , xk ∈ Rn independently from the standard Gaussian measure in Rn and letL be the subspace spanned by x1, . . . , xk (note that x1, . . . , xk are linearly inde-pendent with probability 1, so dimL = k). In other words, we get the measure µn,kas the push-forward of the product γn × . . .× γn on Rn ⊕ . . .⊕ Rn under the map(x1, . . . , xk) 7−→ span x1, . . . , xk. Since the result is invariant under orthogonaltransformations, the resulting measure should coincide with µn,k.

Similarly, let On be the group of the orthogonal transformations of Rn. ThenOn has an invariant metric: for example, for two orthogonal transformations Aand B, let dist(A,B) = maxx∈Sn−1 dist(Ax,Bx). This makes On a compact met-ric space, so there is a unique Borel probability measure νn invariant under theright- and left- multiplications by orthogonal matrices. To sample an orthogonalmatrix A from νn, we sample n vectors x1, . . . , xn independently from the stan-dard Gaussian measure in Rn, apply the Gram-Schmidt orthogonalization processx1, . . . , xn 7−→ u1, . . . , un and let A be the matrix with the columns u1, . . . , un(note that x1, . . . , xn are linearly independent with probability 1, so the pro-cess works). This allows us to understand νn as the push-forward of the product

10

Page 11: Measure Concentration

γn × . . . × γn on Rn ⊕ . . . ⊕ Rn under that map (x1, . . . , xn) 7−→ Gram-Schmidtorthogonalization of x1, . . . , xn.

Now we can state a version of the famous Johnson-Lindenstrauss flatteningLemma.

(3.6) Lemma. Let x ∈ Rn be a non-zero vector and let µn,k be the invariantprobability measure on the Grassmannian Gk(Rn) of k-dimensional subspaces inRn. For L ∈ Gk(Rn), let xL be the orthogonal projection of x onto L.

Then, for any 0 < ε < 1,

µn,k

L ∈ Gk(Rn) :

√n

k‖xL‖ ≥ (1− ε)−1‖x‖

≤ e−ε

2k/4 + e−ε2n/4 and

µn,k

L ∈ Gk(Rn) :

√n

k‖xL‖ ≤ (1− ε)‖x‖

≤ e−ε

2k/4 + e−ε2n/4.

Lecture 3. Monday, January 10

3. The Johnson-Lindenstrauss “Flattening” Lemma, Continued

Proof of Lemma 3.6. Scaling, if necessary, we may assume that ‖x‖ = 1, so thatx ∈ Sn−1. Let us choose a k-dimensional subspace L0 ⊂ Rn. If U ∈ On is anorthogonal transformation, then L = U(L0) =

Uy : y ∈ L0

is a k-dimensional

subspace of Rn. Moreover, as U ranges over the orthogonal group On, the subspaceL ranges over the Grassmannian Gk(Rn) and if we apply a “random” U , we geta “random” L, meaning that the invariant probability measure µn,k on Gk(Rn) isthe push-forward of the invariant probability measure νn on On under the mapU 7−→ U(L0). Therefore,

µn,k

L ∈ Gk(Rn) :

√n

k‖xL‖ ≥ (1− ε)−1

=νn

U ∈ On :

√n

k‖xU(L0)‖ ≥ (1− ε)−1

and

µn,k

L ∈ Gk(Rn) :

√n

k‖xL‖ ≤ (1− ε)

=νn

U ∈ On :

√n

k‖xU(L0)‖ ≤ (1− ε)

.

Now, the length ‖xU(L0)‖ of the projection of x onto U(L0) is equal to the lengthof the projection of y = U−1x onto L0. As U ranges over the orthogonal group On,the vector y = U−1x ranges over the unit sphere Sn−1 and if we apply a “random”

11

Page 12: Measure Concentration

U , we get a “random” y, meaning that the invariant probability measure µn onSn−1 is the push-forward of the invariant probability measure νn on On under themap U 7−→ U−1x.

Therefore,

νn

U ∈ On :

√n

k‖xU(L0)‖ ≥ (1− ε)−1

= µn

y ∈ Sn−1 :

√n

k‖yL0‖ ≥ (1− ε)−1

and

νn

U ∈ On :

√n

k‖xU(L0)‖ ≤ (1− ε)

= µn

y ∈ Sn−1 :

√n

k‖yL0‖ ≤ (1− ε)

.

The proof follows by Corollary 3.4.

Finally, we arrive to the following result, which is also referred to as a version ofthe Johnson-Lindenstrauss “Flattening” Lemma.

(3.7) Theorem. Let a1, . . . , aN be points in Rn. Given an ε > 0, let us choosean integer k such that

N(N − 1)(e−kε

2/4 + e−nε2/4)≤ 1

3.

Remark: for example,k ≥ 4ε−2 ln(6N2)

will do. Assuming that k ≤ n, let L ⊂ Rn be a k-dimensional subspace chosen atrandom with respect to the invariant probability measure µn,k on the GrassmannianGk(Rn). Let a′1, . . . , a

′N be the orthogonal projections of a1, . . . , aN onto L. Then,

(1− ε)‖ai − aj‖ ≤√n

k‖a′i − a′j‖ ≤ (1− ε)−1‖ai − aj‖ for all 1 ≤ i, j ≤ N

with probability at least 2/3.

Proof. Let cij = ai−aj for i > j. There are(N2

)= N(N − 1)/2 vectors cij and the

length of the orthogonal projection of c′ij onto a subspace L is equal to ‖a′i − a′j‖.Using Lemma 3.6, we conclude that the probability that for some pair i, j we haveeither √

n

k‖c′ij‖ ≥ (1− ε)−1‖cij‖ or

√n

k‖c′ij‖ ≤ (1− ε)‖cij‖

(or both) is at most 1/3.

12

Page 13: Measure Concentration

(3.8) Sharpening the estimates. Given a positive integer N and an ε > 0, whatis smallest possible dimension k so that for any N points a1, . . . , aN in Euclideanspace Rn, there are N points b1, . . . , bN in Rk with the property that

(1− ε) dist(ai, aj) ≤ dist(bi, bj) ≤ (1− ε)−1 dist(ai, aj) for all i, j,

where dist is the Euclidean distance? It looks like a difficult question, but one thingis clear: the dimension k should not depend on n since we can always consider Rnas a subspace of a higher-dimensional space. Reasoning as in the proof of Theorem3.7 and making n −→ +∞ (which, in principle, provides us with more choice for arandom subspace L), we can choose any positive integer k such that

k > 4ε−2 ln(N(N − 1)

).

For example, if N = 109 and ε = 0.1 (so that we allow a 10% distortion), we canchoose k = 16, 579. It is not clear whether better estimates are possible, if theyare, we need better methods.

Lecture 4. Wednesday, January 12

4. Concentration in The Boolean Cube

Let us switch the gears and consider something very discrete.

(4.1) The Boolean cube. Let In = 0, 1n be the Boolean cube, that is, the setof all 2n sequences of length n of 0’s and 1’s. Combinatorially, we can think of Inas of the set of all 2n subsets of the set 1, . . . , n.

We make In a metric space by introducing the Hamming distance: the distancebetween two points x = (ξ1, . . . , ξn) and y = (η1, . . . , ηn) is the number of thecoordinates where x and y disagree:

dist(x, y) =∣∣i : ξi 6= ηi

∣∣.Furthermore, let us introduce the counting probability measure µn on In by µnx =2−n.

QUESTIONS: what is the diameter of In? What is the distance between two“typical” points in In (whatever that means)?

The Boolean cube provides a very good model to observe various concentrationphenomena. Since In is finite, all functions f : In −→ R are measurable, all integralsare sums, and we don’t have to worry about subtleties. Later, the Boolean cubewill serve as an inspiration to tackle more complicated spaces. In view of that, wewill freely interchange between∫

In

f dµn and12n∑x∈In

f(x).

13

Page 14: Measure Concentration

Our next goal is to prove that sets A ⊂ In with µn(A) = 1/2 have large(measure-wise) ε-neighborhoods for moderate ε’s of order

√n. To do that, we

prove a concentration inequality for the function f(x) = dist(x,A), where, as usual,dist(x,A) = miny∈A dist(x, y). We use the Laplace transform method of Section2.1. The result below is a particular case of a more general result by M. Talagrand.We also adopt Talagrand’s proof (he used it in a more general situation).

(4.2) Theorem. Let A ⊂ In be a non-empty set and let f : In −→ R, f(x) =dist(x,A) be the distance from x to A. Then, for any t > 0, we have∫

In

etf dµn ≤1

µn(A)

(12

+et + e−t

4

)n.

Proof. Let us denote

c(t) =12

+et + e−t

4.

We use the induction on the dimension n of the cube. If n = 1 then either Aconsists of 1 point, at which case µ1(A) = 1/2 and∫

I1

etf dµ1 =12

+12et ≤ 2c(t),

so the result holds, or A consists of 2 points, at which case µ1(A) = 1 and∫I1

etf dµ1 = 1 ≤ c(t),

in which case the result holds as well.Suppose that n > 1. Let us split the cube In into two “facets”, depending on the

value of the last coordinate: I1n = x : ξn = 1 and I0

n = x : ξn = 0. Note thatI1n and I0

n can be identified with the (n− 1)-dimensional cube In−1. Consequently,let us define subsets A1, A0 ⊂ In−1 by

A0 =x ∈ In−1 : (x, 0) ∈ A

and A1 =

x ∈ In−1 : (x, 1) ∈ A

.

Note that |A| = |A1|+ |A0|, or, in the measure terms,

µn(A) =µn−1(A1) + µn−1(A0)

2.

For a point x ∈ In, let x′ ∈ In−1 be the point with the last coordinate omitted.Now, we have

dist(x,A) = min

dist(x′, A1), dist(x′, A0) + 1

for every point x ∈ I1n and

dist(x,A) = min

dist(x′, A0), dist(x′, A1) + 1

for every point x ∈ I0n,

14

Page 15: Measure Concentration

Let us define functions f0, f1 by f0(x) = dist(x,A0) and f1(x) = dist(x,A1) forx ∈ In−1. Then∫In

etf dµn =∫I1n

etf dµn +∫I0n

etf dµn

=12n∑x∈I1n

exptdist(x,A)

+

12n∑x∈I0n

exptdist(x,A)

=

12n∑x∈I1n

exp

mintdist(x′, A1), t+ tdist(x′, A0)

+

12n∑x∈I0n

exp

mintdist(x′, A0), t+ tdist(x′, A1)

=

12n∑x∈I1n

min

exptdist(x′, A1)

, et exp

tdist(x′, A0)

+

12n∑x∈I0n

min

exptdist(x′, A0)

, et exp

tdist(x′, A1)

=

12

∫In−1

minetf1 , etetf0

dµn−1 +

12

∫In−1

minetf0 , etetf1

dµn−1.

However, the integral of the minimum does not exceed the minimum of the integrals.We carry on, and use the induction hypothesis.∫

In

etf dµn ≤12

min∫

In−1

etf1 dµn−1, et∫In−1

etf0 dµn−1

+

12

min∫

In−1

etf0 dµn−1, et∫In−1

etf1 dµn−1

≤1

2min

c(t)n−1

µn−1(A1), et

cn−1(t)µn−1(A0)

+

12

min cn−1(t)µn−1(A0)

, etcn−1(t)µn−1(A1)

=

=cn−1(t)µn(A)

(12

min µn(A)µn−1(A1)

, etµn(A)

µn−1(A0)

+

12

min µn(A)µn−1(A0)

, etµn(A)

µn−1(A1)

).

Let us denote

a0 =µn−1(A0)µn(A)

and a1 =µn−1(A1)µn(A)

, so that a0 + a1 = 2.

15

Page 16: Measure Concentration

It all boils down to the question: what is the maximum possible value of

12

mina−11 , eta−1

0

+

12

mina−10 , eta−1

1

,

where a0 and a1 are non-negative numbers such that a0 + a1 = 2. If a0 = 0 thena1 = 2 and the value is

14

+et

4≤ c(t).

The same value is attained if a1 = 0 and a0 = 2.Suppose that at a maximum point we have a0, a1 > 0. If a1 = a0 = 1, the value

is12

+12≤ c(t).

If a0 6= a1, without loss of generality, we can assume that a0 > a1, so that a0 = 1+band a1 = 1− b for some 0 < b < 1. Then a−1

0 < eta−11 . If a−1

1 < eta−10 , the value is

12

(1− b)−1 +12

(1 + b)−1 =1

1− b2.

Increasing b slightly, we increase the value, which is a contradiction. If a−11 > eta−1

0 ,then the value is

12eta−1

0 +12a−10 =

et

2(1 + b).

Decreasing b slightly, we increase the value, which is contradiction. Hence at themaximum point we must have

a−11 = eta−1

0 , that is, a0 =2et

1 + etand a1 =

21 + et

,

with the value12a−11 +

12a−10 =

12

+et + e−t

4= c(t).

This allows us to complete the proof:∫In

etf dµn ≤cn−1(t)µn(A)

c(t) =cn(t)µn(A)

.

Here is a useful corollary.16

Page 17: Measure Concentration

(4.3) Corollary. Let A ⊂ In be a non-empty set and let f : In −→ R, f(x) =dist(x,A) be the distance from x to A. Then, for any t > 0, we have∫

In

etf dµn ≤1

µn(A)et

2n/4.

Proof. Follows from Theorem 4.2, since

et = 1 + t+t2

2!+t3

3!+ . . . and e−t = 1− t+

t2

2!− t3

3!+ . . . ,

so

12

+et + e−t

4=1 +

t2

4+

t4

2 · 4!. . .+

t2k

4(2k)!+ . . .

≤1 +t2

4+ . . .+

t2k

4kk!+ . . . = et

2/4.

Lecture 6. Wednesday, January 19

Lecture 5 on Friday, January 14, covered the proof of Theorem 4.2 from theprevious handout.

4. Concentration in The Boolean Cube, Continued

With Corollary 4.3 in hand, we can show that “almost all” points in the Booleancube In are “very close” to any “sufficiently large” set A ⊂ In.

(4.4) Corollary. Let A ⊂ In be a non-empty set. Then, for any ε > 0, we have

µn

x ∈ In : dist(x,A) ≥ ε

√n≤ e−ε

2

µn(A).

Proof. We use the inequality of the Laplace transform method of Section 2.1. Forany t ≥ 0, we have

µn

x ∈ In : dist(x,A) ≥ ε

√n≤e−tε

√n

∫In

exptdist(x,A)

dµn

≤e−tε

√n

µn(A)et

2n/4,

17

Page 18: Measure Concentration

where the last inequality follows by Lemma 4.3. Now, we optimize on t by substi-tuting t = 2ε/

√n.

For example, if µn(A) = 0.01, that is, the set A contains 1% of the points in thecube and we choose ε = 2

√ln 10 ≈ 3.035 in Corollary 4.4, we conclude that 99% of

the points of the cube In are within distance 3.035√n from A.

Let us consider a “reasonable” function f : In −→ R. By “reasonable” we meanreasonably slowly varying, for example 1-Lipschitz:

|f(x)− f(y)| ≤ dist(x, y) for all x, y ∈ In.

We can show that f “almost always” remains very close to its median, that is thenumber mf such that

µn

x ∈ In : f(x) ≥ mf

≥ 1

2and µn

x ∈ In : f(x) ≤ mf

≥ 1

2.

(4.5) Theorem. Let f : In −→ R be a function such that |f(x)−f(y)| ≤ dist(x, y)for all x, y ∈ In. Let mf be the median of f . Then for all ε > 0

µn

x ∈ In : |f(x)−mf | ≥ ε

√n≤ 4e−ε

2.

Proof. Let us consider two sets

A+ =x ∈ In : f(x) ≥ mf

and A− =

x ∈ In : f(x) ≤ mf

.

Then µn(A+), µn(A−) ≥ 1/2. Let A+(ε) be the ε√n-neighborhood of A+ and

A−(ε) be the ε√n-neighborhood of A−.

A+(ε) =x ∈ In : dist(x,A+) ≤ ε

√n

and

A−(ε) =x ∈ In : dist(x,A−) ≤ ε

√n.

By Corollary 4.4,

µn (A+(ε)) ≥ 1− 2e−ε2

and µn (A−(ε)) ≥ 1− 2e−ε2.

Therefore, for

A(ε) = A+(ε) ∩A−(ε) we have µn (A(ε)) ≥ 1− 4e−ε2.

Now, for every x ∈ A(ε), we have |f(x) −mf | ≤ ε√n, which completes the proof.

For example, if we choose ε =√

ln(400) ≈ 2.45, we conclude that for 99% of thepoints of In, the function f does not deviate from the median value by more than2.45

√n.

18

Page 19: Measure Concentration

5. Isoperimetric Inequalities in the Boolean Cube: A Soft Version

Looking at Corollary 4.4 we can ask ourselves, if we can sharpen the bound forsmall sets A, where “small” means that µn(A) is exponentially small in n, such as(1.1)−n, say (it cannot be smaller than 2−n). That is, what can be the minimummeasure of the ε-neighborhood

A(ε) =x ∈ In : dist(x,A) ≤ ε

√n

of a set A ⊂ In of a given measure µn(A). The answer to this question is known,but the construction is particular to the Boolean cube. Instead of discussing it now,let us discuss a softer version first: for a set A ⊂ In, let us consider the functionf(x) = dist(x,A). Since f is 1-Lipschitz, by Theorem 4.5 it concentrates aroundits median. Just as well, it should concentrate around its mean value.

PROBLEMS.1. Let A ⊂ In be a non-empty set and let us consider f : In −→ R defined by

f(x) = dist(x,A). Let mf be the median of f and let

af =∫In

dist(x,A) dµn

be the average value of f . Deduce from Theorem 4.5 that

|mf − af | ≤√n lnn

2+ 4

√n.

2. Let A ⊂ In be a set consisting of a single point A = a. Prove that∫In

dist(x,A) dµn =n

2.

Deduce that for all non-empty A ⊂ In,∫In

dist(x,A) dµn ≤n

2.

Thus we ask ourselves, given µn(A) > 0, how small can the average value a ofdist(x,A) be? Our next goal is to prove the following result.

19

Page 20: Measure Concentration

(5.1) Theorem. Let A ⊂ In be a non-empty set and let

ρ =12− 1n

∫In

dist(x,A) dµn, so 0 ≤ ρ ≤ 12.

Thenln |A|n

≤ ρ ln1ρ

+ (1− ρ) ln1

1− ρ.

Remark. The function

H(ρ) = ρ ln1ρ

+ (1− ρ) ln1

1− ρfor 0 ≤ ρ ≤ 1.

is called the entropy of ρ. Note that H(ρ) = H(1− ρ) and that H(0) = H(1) = 0.The maximum value of H is attained at ρ = 1/2 and is equal to H(1/2) = ln 2. For0 ≤ ρ ≤ 1/2, H(ρ) is increasing and for 1/2 ≤ ρ ≤ 1, H(ρ) is decreasing.

The inequality of Theorem 5.1 should be understood as follows: if the set Ais reasonably large, then the average distance dist(x,A) from x ∈ In to A cannotbe very large. For example, if |A| = 2n, the left hand of the inequality is equalto ln 2, so that ρ must be equal to 1/2 and the average distance from x ∈ In toA is 0, which is indeed the case since A is the whole cube. On the opposite end,the minimum possible value of the right hand side is 0 at ρ = 0. In this case, theaverage distance from x to A is n/2 and |A| should be equal to 1, so A must bea point. If |A| ≥ (1.3)n, then we must have H(ρ) ≥ 0.26, so ρ ≥ 0.07 and theaverage distance from x ∈ In to A should be at most 0.43n. If we try to applyCorollary 4.4, to get the right hand side of the inequality less than 1, we shouldchoose ε ≥

√−n ln 0.65 ≈ 0.66

√n and we get a vacuous estimate that there are

points x ∈ In with dist(x,A) ≤ 0.66n. The moral of the story is that Corollary 3.3is not optimal for sets having exponentially small measures.

We will see that as n −→ +∞, the estimate of Theorem 5.1 is optimal.We will deduce Theorem 5.1 from the following lemma.

(5.2) Lemma. Let A ⊂ In be a non-empty set. Then, for every t ≥ 0,

ln |A|+ t

∫In

dist(x,A) dµn ≤ n ln(et/2 + e−t/2

).

Proof. We prove the inequality by induction on n and in doing so, we mimic theproof of Theorem 4.2. Let c(t) = ln

(et/2 + e−t/2

).

For n = 1, there are two cases: if |A| = 1 then the inequality reads t/2 ≤ c(t),which is indeed the case. If |A| = 2, the inequality reads ln 2 ≤ c(t), which is alsothe case.

20

Page 21: Measure Concentration

Suppose that n > 1. Let us define the facets I0n, I

1n ⊂ In the sets A0, A1 ⊂ In−1

and x′ ∈ In−1 for x ∈ In as in the proof of Theorem 4.2. Let us also denotef(x) = dist(x,A), f0(x) = dist(x,A0), and f1(x) = dist(x,A1). So

dist(x,A) = min

dist(x′, A1), dist(x′, A0) + 1

for every point x ∈ I1n and

dist(x,A) = min

dist(x′, A0), dist(x′, A1) + 1

for every point x ∈ I0n.

Then ∫In

f(x) dµn =∫I0n

f(x) dµn +∫I1n

f(x) dµn.

Besides, ∫I0n

f(x) dµn ≤12

∫In−1

f0 dµn−1,12

+12

∫In−1

f1 dµn−1 and∫I1n

f(x) dµn ≤12

∫In−1

f1 dµn−1,12

+12

∫In−1

f0 dµn−1.

Summarizing,∫In

f dµn ≤ min

12

∫In−1

f0 dµn−1 +12

∫In−1

f1 dµn−1,

12

+∫In−1

f0 dµn−1,12

+∫In−1

f1 dµn−1

.

Since |A| = |A0| + |A1|, we can write |A0| = λ|A|, |A1| = (1 − λ)|A| for some0 ≤ λ ≤ 1. In other words,

ln |A| = ln |A0|+ ln1λ, ln |A| = ln |A1|+ ln

11− λ

, and

ln |A| =12

ln |A0|+12

ln |A1|+12

ln1λ

+12

ln1

1− λ.

Using the induction hypothesis, we conclude that

ln |A|+ t

2

∫In−1

f0 dµn−1 +t

2

∫In−1

f1 dµn−1 ≤ (n− 1)c(t) +12

ln1λ

+12

ln1

1− λ

ln |A|+ t

2+ t

∫In−1

f0 dµn−1 ≤ (n− 1)c(t) +t

2+ ln

and

ln |A|+ t

2+ t

∫In−1

f1 dµn−1 ≤ (n− 1)c(t) +t

2+ ln

11− λ

,

from which

ln |A|+ t

∫In

f dµn ≤(n− 1)c(t)

+ min

12

ln1λ

+12

ln1

1− λ,

t

2+ ln

1λ,

t

2+ ln

11− λ

.

21

Page 22: Measure Concentration

Hence it all boils down to proving that

min

12

ln1λ

+12

ln1

1− λ,

t

2+ ln

1λ,

t

2+ ln

11− λ

≤ c(t)

for all 0 ≤ λ ≤ 1.If λ = 0 or λ = 1, the minimum is t/2 ≤ c(t). If λ = 1/2, the minimum is

ln 2 ≤ c(t). Hence without loss of generality, we may assume that λ > 1/2. Next,we say that the maximum of the minimum is attained when

12

ln1λ

+12

ln1

1− λ=t

2+ ln

1λ.

Indeed, if the left hand side is greater, we can increase the minimum by decreasingλ slightly. If the right hand side is greater, we can increase the minimum byincreasing λ slightly. Thus at the maximum point we have the equality, from whichλ = et/(1 + et) and the value of the minimum is exactly c(t).

Now we are ready to prove Theorem 5.1.

Proof of Theorem 5.1. Using Lemma 5.2, we conclude that for any t ≥ 0, we have

ln |A|n

≤ ln(et/2 + e−t/2

)− t

n

∫In

dist(x,A) dµn

= ln(et/2 + e−t/2

)+ t

(ρ− 1

2

).

Now we optimize on t and substitute

t = ln(1− ρ)− ln ρ,

which completes the proof.

PROBLEM. Prove that under the conditions of Theorem 5.1,

ln |A|n

≥ ln 2−H

(12− ρ

), where

H(x) = x ln1x

+ (1− x) ln1

1− x

is the entropy function. The bound is asymptotically sharp as n −→ +∞.This problem is harder than our problems have been so far.

Lecture 8. Monday, January 24

Lecture 7 on Friday, January 21, covered the proof of Theorem 5.1 from theprevious handout.

22

Page 23: Measure Concentration

6. The Hamming Ball

Let us consider a ball in the Hamming metric in the Boolean cube In. Allsuch balls look the same, so we consider one particular, centered at the point0 = (0, . . . , 0).

(6.1) Definition. The Hamming ball B(r) of radius r is the set of points in theBoolean cube within Hamming distance r from 0, that is,

B(r) =x : dist(x,0) ≤ r

.

Equivalently, B(r) consists of the points (ξ1, . . . , ξn), such that

n∑i=1

ξi ≤ r and ξi ∈ 0, 1 for all i = 1, . . . , n.

How many points are there in the Hamming ball? We have

|B(r)| =r∑

k=0

(n

k

),

since to choose a point x ∈ B(r) we have to choose k ≤ r positions, fill them by 1’sand fill the remaining n− k positions by 0’s.

Here is a useful asymptotic estimate.

(6.2) Lemma. Let

H(x) = x ln1x

+ (1− x) ln1

1− xfor 0 ≤ x ≤ 1

be the entropy function. Let us choose a 0 ≤ λ ≤ 1/2 and let Bn be the Hammingball in In of radius bλnc. Then

(1)ln |Bn| ≤ nH(λ);

(2)lim

n−→+∞n−1 ln |Bn| = H(λ).

Proof. Clearly, we may assume that λ > 0.To prove (1), we use the Binomial Theorem:

1 = (1 + (1− λ))n =n∑k=0

(n

k

)λk(1− λ)n−k ≥

∑0≤k≤λn

(n

k

)λk(1− λ)n−k.

23

Page 24: Measure Concentration

Now, since λ ≤ 1/2, we have

λ

1− λ≤ 1, and, therefore, λi(1− λ)n−i ≥ λj(1− λ)n−j for j ≥ i.

Continuing the chain of inequalities above, we have

1 ≥∑

0≤k≤λn

(n

k

)λλn(1− λ)n−λn = |Bn|e−nH(λ).

Thus |Bn| ≤ enH(λ) as desired.To prove (2), we use (1) and Stirling’s formula

lnn! = n lnn− n+O(lnn) as n −→ +∞.

Letting m = bλnc = λn+O(1) as n −→ +∞, we get

n−1 ln |Bn| ≥n−1 ln(n

m

)= n−1 (lnn!− lnm!− ln(n−m)!)

=n−1 (n lnn− n−m lnm+m− (n−m) ln(n−m) +O(lnn))

=n−1(n lnn− n− λn ln(λn) + λn

− (1− λ)n ln(1− λ)n+ (1− λ)n+ o(n))

=n−1 (nH(λ) + o(n)) = H(λ) + o(1).

Together with (1), this concludes the proof.

(6.3) The Hamming ball as an asymptotic solution to the isoperimetricinequality. Let us fix a 0 ≤ λ ≤ 1/2. In Theorem 5.1, let us choose A = B(r)with r = bλnc as n −→ +∞.

What is the average distance from a point x ∈ In to A = B(r)? If the coordinatesof x = (ξ1, . . . , ξn) contain k ones and n−k zeroes, then dist(x,A) = min0, k−r.As follows from Corollary 4.4, for example, a “typical” point x ∈ In containsn/2+O(

√n) ones, so the distance dist(x,A) should be roughly 0.5n−r = n(0.5−λ).

This is indeed so.

PROBLEM. Let A = B(r) with r = bλnc for some 0 ≤ λ ≤ 1/2. Prove that

limn−→+∞

1n

∫In

dist(x,A) dµn =12− λ.

On the other hand, as follows by Lemma 6.2,

limn−→+∞

ln |A|n

= H(λ) = λ ln1λ

+ (1− λ) ln1

1− λ.

This shows that the inequality of Theorem 5.1 is sharp on Hamming balls.24

Page 25: Measure Concentration

(6.3) The exact solution to the isoperimetric inequality. As we mentioned,the Hamming ball is also the exact solution to the isoperimetric inequality in theBoolean cube: among all sets of a given cardinality it has the smallest cardinality ofthe ε-neighborhood for any given ε > 0, not necessarily small. There is, of course,a catch here: not every positive integer can be the cardinality of a Hamming ball,because |B(r)| is always a certain sum of binomial coefficients. Thus to cover trulyall possibilities, we should consider partially filled Hamming balls. This calls for adefinition.

(6.3.1) Definition. Let us define the simplicial order in the Boolean cube In asfollows: for x = (ξ1, . . . , ξn) and y = (η1, . . . , ηn), we say that x < y if eitherξ1 + . . .+ ξn < η1 + . . .+ ηn or ξ1 + . . .+ ξn = η1 + . . .+ ηn and for the smallest isuch that ξi 6= ηi, we have ξi = 1 and ηi = 0.

Given an integer 0 ≤ k ≤ 2n, we can consider the first k elements of the simplicialorder in In. For example, if n = 5 and k = 12, then the first 12 elements in thesimplicial order are

(0, 0, 0, 0, 0), (1, 0, 0, 0, 0), (0, 1, 0, 0, 0), (0, 0, 1, 0, 0), (0, 0, 0, 1, 0),

(0, 0, 0, 0, 1), (1, 1, 0, 0, 0), (1, 0, 1, 0, 0), (1, 0, 0, 1, 0), (1, 0, 0, 0, 1),

(0, 1, 1, 0, 0), (0, 1, 0, 1, 0).

For a non-empty set A ⊂ In and a positive integer t, let us define

A(t) =x ∈ In : dist(x,A) ≤ t

.

PROBLEMS.1. Suppose that A = B(r) is the Hamming ball of radius r. Prove that A(t) =

B(r + t).2. Check that the Hamming ball is an interval in the simplicial order, that is,

the set of all points not exceeding a particular x ∈ In in the simplicial order.3. Let B ⊂ In be an interval in the simplicial order. Prove that B(t) is an

interval in the simplicial order for any positive integer t.

Now, Harper’s Theorem.

(6.3.1) Theorem. Given a non-empty set A ⊂ In, let B ⊂ In be the first |A|elements of In in the simplicial order. Then

|A(1)| ≥ |B(1)|.

Although we don’t prove Theorem 6.3.1, here a useful corollary.25

Page 26: Measure Concentration

(6.3.2) Corollary. Let A ⊂ In be a set such that

|A| ≥r∑

k=0

(n

k

).

Then

|A(t)| ≥r+t∑k=0

(n

k

)for all t > 0.

Here is another useful corollary.

(6.3.3) Corollary. Suppose that n is even and let A ⊂ In be a set such that|A| ≥ 2n−1 (that is, µn(A) ≥ 1/2). Then

|A(t)| ≥ 2n − expnH

(12− t

n

)that is,

µn (A(t)) ≥ 1− exp−n(

ln 2−H

(12− t

n

)), for t = 1, . . . , n/2,

whereH(x) = x ln

1x

+ (1− x) ln1

1− x, 0 ≤ x ≤ 1

2

is the entropy function.

PROBLEM.1. Deduce Corollaries 6.3.2 and 6.3.3 from Theorem 6.3.1.

Lecture 9. Wednesday, January 26

7. The Martingale Method and Its Application to the Boolean Cube

Having the Boolean cube In as an example, we describe a general method ofobtaining concentration inequalities via martingales. Although the inequalities weobtain through the martingale approach are not as sharp as the ones we get viaisoperimetric inequalities (exact or asymptotic), the martingale approach is verygeneral, very simple, and allows us to get concentration results in a variety ofinstances. Besides, it produces estimates for the concentration about the averagevalue of the function (as opposed to the median value), which comes in handy undera variety of circumstances. We start with an abstract definition.

26

Page 27: Measure Concentration

(7.1) Conditional expectation. Let X be a space with a probability measureµ, let f : X −→ R be an integrable function and let F be a σ-algebra of somemeasurable subsets of X. In other words, F is a collection of subsets which contains∅ and X, if A ∈ F then X \ A ∈ F and F is closed under operations of takingcountable unions and intersections.

A function h : X −→ R is called the conditional expectation of f with respectto F and denoted h = E(f |F) if h is measurable with respect to F (that is,h−1(A) ∈ F for every Borel set A ⊂ R) and∫

Y

h dµ =∫Y

f dµ for all Y ∈ F .

Although conditional expectations exist (and unique) under widest assumptions(the Radon-Nikodym Theorem), we will use them in the following simple (butsufficiently general) situation. The space X will be finite (but large, keep theBoolean cube in mind), so that µ assigns some positive real weights to the elementsx ∈ X and the integral is just the sum∫

Y

f dµ =∑x∈Y

f(x)µ(x).

The family F will be a collection of pairwise disjoint subsets, called blocks, whoseunion is X. In this case, the conditional expectation h = E(f |F) is defined asfollows: the value of h(y) is just the average value of f on the block containing y:

h(y) =1

µ(Y )

∫Y

f(x) dµ provided y ∈ Y and Y ∈ F .

To be consistent, let us denote

E(f) =∫X

f dµ.

If F consist of a single block X, the conditional expectation E(f |F) is just theconstant function on X whose value at every point is equal to the average value off on X. Here are some trivial, yet useful, properties of the conditional expectation:

• We haveE(f) = E (E(f |F)) .

In words: to compute the average of f , one can first compute the averages of f overeach block of the partition F and then average those averages;

• Similarly, if a partition F2 refines a partition F1, that is, if every block of F1

is a union of some blocks of F2, then

E(E(f |F2)|F1) = E(f |F1).27

Page 28: Measure Concentration

In words: if we average over the smaller blocks and then over the larger blocks, theresult is the same as if we average over the larger blocks;

• Suppose that f(x) ≤ g(x) for all x ∈ X. Then

E(f |F) ≤ E(g|F) pointwise on X.

In words: if f does not exceed g at every point of x, then the average of f does notexceed the average of g over every block of the partition F ;

• Suppose that g is constant on every block of the partition F . Then

E(gf |F) = gE(f |F).

In words: if every value of the function on a block of the partition F is multipliedby a constant, the average value of the function on that block gets multiplied by aconstant.

(7.2) The idea of the martingale approach. Suppose we have a space X asabove and a function f : X −→ R. Suppose that we have a sequence of partitionsF0, . . . ,Fn of X into blocks. That is, each Fi is just a collection of some pairwisedisjoint blocks Y ⊂ X whose union is X. We also assume that the following holds:

The first partition F0 consists of a single block that is the set X itself;The blocks of the last partition Fn are the singletons x for x ∈ X;Each Fi+1 refines Fi for each i, that every block Y ∈ Fi is a union of some

blocks Yj from Fi+1.

Then, for each i = 0, . . . , n, we have a function fi : X −→ R which is obtainedby averaging f on the blocks of Fi. That is, fi = E(f |Fi). Note that E(f |F0)is the constant function equal to the average value of f on X, while fn = f isthe function f itself. Hence the functions fi kind of interpolate between f andits average. The collection of function f0, . . . , fn is called a martingale. The namecomes from games of chance, where it refers to a specific strategy of playing roulette.The idea of using the term martingale in a more general context is that it refers toa strategy of playing (averaging) that does not change the final average (expectedwin).

Our main result is as follows.

(7.3) Theorem. Let fi : X −→ R, i = 0, . . . , n, be a martingale. Suppose thatfor some numbers d1, . . . , dn, we have

|fi(x)− fi−1(x)| ≤ di for i = 1, . . . , n and all x ∈ X.

Leta = E(f) =

∫X

f dµ,

28

Page 29: Measure Concentration

so that f0(x) = a for all x ∈ X. Let

D =n∑i=1

d2i .

Then, for any t ≥ 0, we have

µx : f(x) ≥ a+ t

≤ e−t

2/2D

andµx : f(x) ≤ a− t

≤ e−t

2/2D.

To prove Theorem 7.3, we need a little technical lemma.

(7.4) Lemma. Let f : X −→ R be a function such that∫X

f dµ = 0 and |f(x)| ≤ d for all x ∈ X.

Then, for any λ ≥ 0, ∫X

eλf dµ ≤ eλd + e−λd

2≤ eλ

2d2/2.

Proof. Without loss of generality we assume that d = 1. Next, we note that eλt isa convex function and hence its graph lies beneath the chord connecting the points(−1, e−λ) and (1, λ). This gives us the inequality

eλt ≤ eλ + e−λ

2+eλ − e−λ

2t for − 1 ≤ t ≤ 1.

The substitution t = f(x) and integration with respect to x produces∫X

eλf dµ ≤ eλ + e−λ

2≤ eλ

2/2,

where the last inequality follows by comparing the Taylor series expansions.

Proof of Theorem 7.3. Clearly, it suffices to prove the first inequality as the secondfollows by switching f to −f .

We use the Laplace transform method of Section 2.1. That is, for every λ ≥ 0,we can write

µx ∈ X : f(x)− a ≥ t

≤ e−λt

∫X

eλ(f−a) dµ,

29

Page 30: Measure Concentration

and our goal is to estimate the integral.We can write

f − a = fn − a =n∑i=1

(fi − fi−1) .

Letting gi = fi − fi−1, we have∫X

eλ(f−a) dµ =∫X

n∏i=1

eλgi dµ = E(eλg1 · · · eλgn

).

Let us denote

Ek(h) = E(h|Fk) for any function h : X −→ R.Hence we can write

E(eλg1 · · · eλgn

)= E

(E0E1 . . .En−1

(eλg1 · · · eλgn

)).

In words: we compute the average first by averaging over the finest partition, thenover the less fine partition and so on, till we finally average over the partitionconsisting of the whole space.

Let us take a closer look at

E0 . . .Ek−1

(eλg1 · · · eλgk

).

We observe that gi are constants on the blocks of the partition Fi and hence onthe blocks of each coarser partition Fj with j ≤ i. In particular, g1, . . . , gk−1 areconstants on the blocks of Fk−1, which enables us to write

E0 . . .Ek−1

(eλg1 · · · eλgk

)= E0 · · ·Ek−2

(eλg1 · · · eλgk−1

)Ek−1(eλgk).

Now, what do we know about gk? First, |gk(x)| ≤ dk for all x, and second,

Ek−1gk = E(fk|Fk−1)−E(fk−1|Fk−1) = fk−1 − fk−1 = 0

(the function that is identically zero on X). Therefore, by Lemma 7.4,

Ek−1eλgk ≤ eλ

2d2k/2

(the pointwise inequality on X). Hence

E0 . . .Ek−1

(eλg1 · · · eλgk

)≤ eλ

2d2k/2E0 · · ·Ek−2

(eλg1 · · · eλgk−1

).

Proceeding as above, we conclude that

E(eλg1 · · · eλgn

)=E

(E0E1 . . .En−1

(eλg1 · · · eλgn

))≤ exp

λ2

2

n∑k=1

d2k

= eλ

2D/2.

Now we choose λ = t/D and conclude that

µx ∈ X : f(x)− a ≥ t

≤ e−λt

∫X

eλ(f−a) dµ ≤ e−t2/De−t

2/2D = e−t2/2D

as claimed.

As an application, let us see what kind of a concentration result can we get forthe Boolean cube.

30

Page 31: Measure Concentration

(7.5) Theorem. Let f : In −→ R be a function such that

|f(x)− f(y)| ≤ dist(x, y) for all x, y ∈ In.

Letaf =

∫In

f dµn

be the average value of f on the Boolean cube. Then, for all t ≥ 0,

µn

x : f(x) ≥ af + t

≤ e−2t2/n and µn

x : f(x) ≤ af − t

≤ e−2t2/n.

Proof. To apply Theorem 7.3, we need to associate a martingale f0, . . . , fn to f .Let F0 be the partition consisting of a single block, the cube In itself, and let Fnbe the partition where the blocks are the points of In. To interpolate between F0

and Fn, let us define F1 as the partition consisting of the two blocks, the “upperfacet”

I1n =

(ξ1, . . . , ξn) : ξn = 1

and the “lower facet”

I0n =

(ξ1, . . . , ξn) : ξn = 0

.

Since each of the facets looks like the Boolean cube In−1, we can proceed subdi-viding them further, finally arriving to Fn. This defines the martingale f0, . . . , fn,where fi : In −→ R is the function obtained by averaging of f on the blocks of Fi(where each of the 2i blocks of Fi is obtained by fixing the last i coordinates of apoint x ∈ In). To apply Theorem 7.3, we need to estimate |fi − fi−1|.

Let us estimate |f1 − f0|. We observe that f0 is a constant on the whole cubeequal to the average value of f on In. Next, we observe that f1 takes two values: ifx ∈ I1

n then f(x) is the average value of f on the upper facet I1n and if x ∈ I0

n thenf(x) is the average value of f on the lower facet I0

n. These two averages cannotbe too different: if we switch the last coordinate of x ∈ In, we switch between theupper and the lower facets and while we are doing that, the value of f can changeby at most 1. This proves that the difference between the average values of f on I1

n

and on I0n does not exceed 1. Since the average value of f over the whole cube is the

average of the averages over the facets, we conclude that we can choose d1 = 1/2in Theorem 7.3. Just the same, we can choose di = 1/2 for i = 1, . . . , n, whichcompletes the proof.

Rescaling, we obtain

µn

x ∈ In : |f(x)− af | ≥ ε

√n≤ 2e−2ε2 ,

31

Page 32: Measure Concentration

which is similar in spirit to Theorem 4.5, only instead estimating the deviation fromthe median mf , we estimate the deviation from the average af .

PROBLEMS.1. Let Sn be the symmetric group, that is, the group of all bijections σ :

1, . . . , n −→ 1, . . . , n. Let us make Sn a metric space by introducing theHamming distance

dist(σ, τ) = |i : σ(i) 6= τ(i)|

and a probability space by introducing the uniform probability measure µn(σ) =1/n! for all σ ∈ Sn. Using the martingale approach, prove that for a 1-Lipschitzfunction f : Sn −→ R and any t ≥ 0,

µn

x : f(x) ≥ af + t

≤ e−t

2/2n and µn

x : f(x) ≤ af − t

≤ e−t

2/2n.

2. Let us modify the measure µn on the Boolean cube as follows. Let p1, . . . , pnbe numbers such that 0 < pi < 1 for i = 1, . . . , n. For x ∈ In, x = (ξ1, . . . , ξn), letus define µn(x) as the product of n factors, where the ith factor is pi if ξi = 1and the ith factor is 1− pi if ξi = 0. Prove that under conditions of Theorem 7.5,we have

µn

x : f(x) ≥ af + t

≤ e−t

2/2n and µn

x : f(x) ≤ af − t

≤ e−t

2/2n.

Lecture 11. Monday, January 31

Lecture 10 on Friday, January 28, covered the proof of Theorem 7.3 from theprevious handout.

8. Concentration in the product spaces. An application:the law of large numbers for bounded functions

Let us apply the martingale approach in the following situation. Suppose thatfor i = 1, . . . , n we have a space Xi with a probability measure µi and let

X = X1 × · · · ×Xn

be the direct product of the spaces Xi, that is, the set of all n-tuples (x1, . . . , xn)with xi ∈ Xi. Let us introduce the product measure µ = µ1× · · ·×µn on X. Thusthe measurable sets in X are countable unions of the sets A = A1×· · ·×An, whereAi ⊂ Xi is measurable, and

µ(A) = µ1(A1) · · ·µn(An).

We also consider the Hamming distance on X:

dist(x, y) = |i : xi 6= yi| for x = (x1, . . . , xn) and y = (y1, . . . , yn).

32

Page 33: Measure Concentration

(8.1) Theorem. Let f : X −→ R be an integrable function and let d1, . . . , dn benumbers such that |f(x) − f(y)| ≤ di if x and y differ in the ith coordinate only.Let

a =∫X

f dµ

be the average value of f and let

D =n∑i=1

d2i .

Then, for any t ≥ 0

µx : f(x) ≥ a+ t

≤ e−t

2/2D and µx : f(x) ≤ a− t

≤ e−t

2/2D.

Proof. Of course, we construct a martingale associated with f . Namely, we definefi, i = 0, . . . , n as follows. We have f0 : X −→ R is a function which is constanton X, fn = f and fi is obtained by integrating f with respect to the last n − icoordinates:

fi(x1, . . . , xn) =∫Xi+1×···×Xn

f(x1, . . . , xn) dµi+1 · · · dµn.

In words: fi(x1, . . . , xn) depends on the first i coordinates only and obtained fromf by averaging over the last n− i coordinates. One can view fi as the conditionalexpectation E(f |Fi), where Fi is the σ-algebra consisting of the countable unionsof the sets of the type

A1 × · · · ×Ai ×Xi+1 × · · · ×Xn where Aj ⊂ Xj are measurable for j ≤ i.

If we think of Fi in terms of partitions and blocks, then the blocks of Fi consist ofsingletons of X1×· · ·×Xi multiplied by the whole spaces Xi+1×· · ·×Xn. Lettinggi = fi − fi−1, it is pretty clear that

|gi(x)| ≤ di for i = 1, . . . , n

and that

E(gi|Fi−1) = E(fi|Fi−1)−E(fi−1|Fi−1) =∫Xi

fi dµi − fi−1 = fi−1 − fi−1 = 0,

so the whole proof of Theorem 7.3 carries over.

33

Page 34: Measure Concentration

(8.2) A typical application: the law of large numbers. Typically, Theorem8.1 is applied under the following circumstances. We have a probability space Y ,say, with a measure ν, say, and an integrable function h : Y −→ R such that

Eh =∫Y

h dν = a,

say. We sample n points y1, . . . , yn ∈ Y independently at random, compute thesample average

h(y1) + . . .+ h(yn)n

,

and ask ourselves how far and how often can it deviate from the average a? Wesuppose that 0 ≤ h(y) ≤ d for some d and all y ∈ Y .

Let us make n copies X1, . . . , Xn of Y and let µi be the copy of ν on Xi. Tosample n points y1, . . . , yn ∈ Y “independently at random” is the same as to samplea single point x = (y1, . . . , yn) ∈ X for X = X1 × · · · × Xn with respect to theproduct measure µ = µ1 × · · · × µn. Now,

f(y1, . . . , yn) =h(y1) + . . .+ h(yn)

n,

so if we change only one coordinate, the value of f changes by not more than d/n.Hence we can apply Theorem 8.1 with

D =n∑i=1

(d

n

)2

=d2

n.

This gives us that

µn

x : f(x) ≥ a+ t

≤ e−nt

2/2d2 and µn

x : f(x) ≤ a− t

≤ e−nt

2/2d2 .

Rescaling t = εd/√n, we get that the probability that

h(y1) + . . .+ h(yn)n

deviates from a by more than εd/√n does not exceed 2e−ε

2/2. Taking ε = 3,for example, we conclude that the probability that the sample average of h overn random samples does not deviate from the average by 3d/

√n is at least 97%.

Hence, if we take n = 10, 000, say, the probability that the sample average will notdeviate from the average by more than 0.03d will be at least 97%.

34

Page 35: Measure Concentration

(8.3) Another typical application: serving many functions at the sametime. Let us adjust the situation of Section 8.2 a bit. Suppose that instead ofone function h we have N functions h1, . . . , hN , having averages a1, . . . , aN , andsuch that 0 ≤ hi(y) ≤ d for all y ∈ Y and all i = 1, . . . , N . We sample n pointsy1, . . . , yn and compute the sample average for every function hi:

hi(y1) + . . .+ hi(yn)n

for i = 1, . . . , N

on the same set of samples. We ask ourselves, how many points should we sampleso that with high probability each of the N sample averages will be reasonablyclose to the corresponding average ai. Applying the estimate, we conclude thatthe probability to get within t from the expectation ai for any particular sampleaverage is 1−2e−nt

2/2d2 . Therefore, the probability that every single sample averageis within εd from the expectation ai is at least 1 − 2Ne−nε

2/2. To make thatreasonably high, we should choose n ∼ ε−2 lnN . In other words, given ε and d, thenumber of samples should be only logarithmic in the number N of functions.

For example, to get the probability at least 2/3, we can take any n > 2ε−2 ln(6N).Thus if N = 109 and ε = 0.1 (so that we stay within 10% of d from each of the 109

averages), we can choose n = 4, 504.

PROBLEMS1. Suppose that we require |h(y)| ≤ d in Sections 8.2 and 8.3 above (that is, we

allow h to be negative). Check that the inequalities read

µn

x : f(x) ≥ a+ t

≤ e−nt

2/8d2 and µn

x : f(x) ≤ a− t

≤ e−nt

2/8d2 .

2. Suppose that h1, . . . , hn : Y −→ R are functions such that 0 ≤ hi ≤ 1 fori = 1, . . . , n and let ai = Ehi. For n points y1, . . . , yn sampled independently atrandom, let

f(y1, . . . yn) =h1(y1) + . . .+ hn(yn)

n

and let a = (a1 + . . .+ an)/n. Prove that

µn

x : f(x) ≥ a+ t

≤ e−nt

2/2d2 and µn

x : f(x) ≤ a− t

≤ e−nt

2/2d2 .

9. How to sharpen martingale inequalities?

Although often quite helpful, martingale inequalities are rarely sharp. One ofthe rare examples where we can figure out optimal estimates is the Boolean cube.

(9.1) Example: tossing a fair coin. Let In = 0, 1n be the Boolean cube withthe uniform probability measure µnx = 2−n. Let us consider f : In −→ R definedby f(ξ1, . . . , ξn) = ξ1 + . . . + ξn. That is, we toss a fair coin n times and count

35

Page 36: Measure Concentration

the number of heads, say. Then the average value af of f is n/2 (why?) and f is1-Lipschitz. Hence Theorem 7.5 gives us

µn

x : f(x)− n/2 ≤ −t

≤ e−2t2/n.

Suppose that t grows linearly with n, t = λn for some 0 < λ < 1/2. Then theestimate becomes

µn

x : f(x)− n/2 ≤ −λn

≤ e−2nλ2

.

Suppose that n is even, n = 2m. Suppose that t is integer. Then the points x ∈ Inwhere f(x)−m ≤ −t are the points with x having 0, 1, . . . ,m− t zero coordinates.Hence

µn

x : f(x)−m ≤ −t

= 2−n

m−t∑k=0

(n

k

)= 2−n|B(m− t)|,

where B(m − t) is the Hamming ball of radius m − t. If t ≈ λn, by Part (2) ofLemma 6.2, we can estimate the right hand side by

2−n expnH

(12− λ

),

where H is the entropy function. Assuming that λ ≈ 0, we get

H

(12− λ

)=(

12− λ

)ln

11− 2λ

+(

12

+ λ

)ln

11 + 2λ

+ ln 2.

Using that

ln(1 + x) = x− x2

2+O(x3),

we get

H

(12− λ

)= ln 2− 2λ2 +O(λ3).

Hence, for λ ≈ 0, we true estimate

µn

x : f(x)−m ≤ −λn

≈ e−2nλ2

agrees with the one given by Theorem 7.5.However, as λ grows, the margin between the martingale bound and the true

bound starts to widen since

H

(12− λ

)− ln 2

starts to deviate from −2λ2. But even for the endpoint λ = 1/2, the difference isnot too great: the martingale bound estimates the probabilities by e−0.5n, whilethe true value is e−(ln 2)n ≈ e−0.69n.

Lecture 12. Wednesday, February 2

36

Page 37: Measure Concentration

9. How to sharpen martingale inequalities? Continued

(9.2) Example: tossing an unfair coin. Let us choose a number 0 < p < 1 andlet q = 1− p. Let In = 0, 1n be the Boolean cube and let us define the measureµn on In by µnx = pkqn−k, where k is the number of 1’s among the coordinatesof In (check that this indeed defines a probability measure on In). Let us considerf : In −→ R defined by f(ξ1, . . . , ξn) = ξ1 + . . .+ ξn. That is, we have a coin whichturns up heads with probability p, we toss it n times and count heads. The averagevalue af of f is np (why?) and f is 1-Lipschitz. It follows from Theorem 7.3 (cf.also Problem 2 after Theorem 7.5) that

µn

x : f(x)− np ≤ −t

≤ e−t

2/2n and µn

x : f(x)− np ≥ t

≤ e−t

2/2n.

After some thought, we conclude that it is natural to scale t = αnp for some α ≥ 0,which gives us

µn

x : f(x)− np ≤ −αnp

≤ e−np

2α2/2 and

µn

x : f(x)− np ≥ αnp

≤ e−np

2α2/2.

These estimates do not look too good for small p. We would rather have somethingof the order of e−npα

2/2. Can we make it?In fact, we can, if the examine the proof of Theorem 7.3 and its application

to the Boolean cube. Let us construct the martingale f0, . . . , fn : In −→ R asin the proof of Theorem 7.5, repeatedly cutting the cube into the “upper” and“lower” facets I1

n and I0n and averaging f over the facets. Thus, f0 is the constant

function on In, which at every point x ∈ In is equal to the average value of f onIn. Also, fn = f . Furthermore, for every k, the function fk depends on the firstk coordinates only. That is, fk(ξ1, . . . , ξn) is equal to the average value of f onthe set of points of the face of the Boolean cube consisting of the points wherethe first k coordinates have the prescribed values ξ1, . . . , ξk. In terms of the coin,fk(ξ1, . . . , ξn) is the expected number of heads on n tosses given that the first ktosses resulted in ξ1, . . . , ξk, where ξi = 1 if we got heads at the ith toss and ξi = 0if we got tails. Hence we conclude that

fk(ξ1, . . . , ξn) = (n− k)p+k∑i=1

ξi.

Indeed, after k tosses, we got ξ1 + . . . + ξk heads and n − k tosses left. Since thecoin has no memory, the expected number of heads obtained in the course of thosen− k tosses is (n− k)p. Going back to the proof of Theorem 7.3, we should refrainfrom using Lemma 7.4 as too crude and compute Ek−1

(eλgk

)directly. Recall that

gk = fk − fk−1. Hence gk is defined by a very simple formula:

gk(ξ1, . . . , ξn) = ξk − p.37

Page 38: Measure Concentration

Consequently,

eλgk(x) =eλq if ξk = 1e−λp if ξk = 0

where x = (ξ1, . . . , ξn).

Since Ek−1

(eλgk

)is obtained by averaging over the kth coordinate, we conclude

that Ek−1

(eλgk

)is the constant function equal to peλq + qe−λp. Therefore, in the

proof of Theorem 7.3, we have∫In

eλ(f−a) dµn =(peλq + qe−λp

)n,

which gives us the bound:

µn

x ∈ In : f(x)− np ≥ t

≤ e−λt

(peλq + qe−λp

)n.

To obtain an inequality in another direction, by the Laplace transform method, wehave

µn

x ∈ In : f(x)− np ≤ −t

≤ e−λt

∫In

eλ(a−f) dµn.

It remains to notice that n − a and n − f are obtained from a and f respectivelyby switching p and q. Therefore,

µn

x ∈ In : f(x)− np ≤ −t

≤ e−λt

(qeλp + pe−λq

)n.

Optimizing on λ, we get in the first inequality and the second inequality

t

npq= 1− e−λ, so λ = − ln

(1− t

npq

).

Note that we should assume that t < npq, which we may or may not want to do.Now, if t = αnpq for sufficiently small α > 0 then the optimal λ ≈ α. Thus it

makes sense to substitute λ = α, optimal or not. This gives us the bounds

µn

x ∈ In : f(x)− np ≥ αnpq

≤(peαq(1−αp) + qe−αp(1+αq)

)nand

µn

x ∈ In : f(x)− np ≤ −αnpq

≤(qeαp(1−αq) + pe−αq(1+αp)

)n.

If p = o(1) and α ≈ 0 then both bounds are of about (1 − α2p/2)n ≈ e−α2pn/2 as

we wanted.

Lecture 13. Friday, February 4

38

Page 39: Measure Concentration

10. Concentration and Isoperimetry

We saw in Section 4 how to deduce concentration from isoperimetric inequalities.We worked with the Boolean cube as an example, but the construction is fairlygeneral. Namely, if X is a metric space with a probability measure µ such thatt-neighborhoods

A(t) =x : dist(x,A) ≤ t

of sets A ⊂ X with µ(A) ≥ 1/2 have a large measure, say,

µ (A(t)) ≥ 1− e−ct2

for some constant c > 0, then 1-Lipschitz functions f : X −→ R concentrate aroundtheir median mf . To show that, we consider the sets

A+ =x : f(x) ≥ mf

and A− =

x : f(x) ≤ mf

.

Then µ(A−), µ(A+) ≥ 1/2 and so µ (A−(t)) ≥ 1− e−ct2 and µ (A+(t)) ≥ 1− e−ct2 .Therefore,

µ (A−(t) ∩A+(t)) ≥ 1− 2e−ct2.

Furthermore, we have

|f(x)−mf | ≤ t for x ∈ A−(t) ∩A+(t).

Henceµx : |f(x)−mf | ≤ t

≥ 1− 2e−ct

2.

It works in the other direction too. Suppose that we have a concentration resultfor 1-Lipschitz functions f . That is, we have

µx : f(x) ≥ af + t

≤ e−ct

2and µ

x : f(x) ≤ af − t

≤ e−ct

2

for some constant c > 0, where af is the average value of f . Then sets A withµ(A) ≥ 1/2 have large t-neighborhoods. To see that, let us consider the function

f(x) = dist(x,A).

Let us choose δ > 0 such that e−cδ2< 1/2. Then there exists an x ∈ A such that

f(x) > af − δ. But f(x) = 0 for x ∈ A, which implies that af < δ. Therefore,

µx : dist(x,A) ≥ δ + t

≤ 1− e−ct

2,

that is, µ (A(δ + t)) ≥ 1− e−ct2.

39

Page 40: Measure Concentration

Therefore, martingale concentration results imply some kind of isoperimetric in-equalities. Consider, for example, Theorem 8.1. It describes the following situation.For i = 1, . . . , n we have a space Xi with a probability measure µi. We let

X = X1 × · · · ×Xn and µ = µ1 × · · · × µn.

We consider the Hamming distance on X:

dist(x, y) = |i : xi 6= yi| for x = (x1, . . . , xn) and y = (y1, . . . , yn).

Given a set A ⊂ X such that µ(A) ≥ 1/2, we consider the function f : X −→ Rdefined by f(x) = dist(x,A) (we assume it is measurable). Then we can choosedi = 1 in Theorem 8.1, so D = n. Hence we define δ from the inequality e−δ

2/2n >1/2, which gives us δ <

√n2 ln 2. Therefore, af ≤

√n2 ln 2. This gives us the

isoperimetric inequality

µ(A(t+

√n2 ln 2)

)≥ 1− e−t

2/2n.

Rescaling t = (ε−√

2 ln 2)√n for ε >

√2 ln 2, we have

µx : dist(x,A) ≤ ε

√n≥ 1− exp

− (ε−

√2 ln 2)2

2

.

PROBLEM. Let f : In −→ R be a 1-Lipschitz function on the Boolean cubeIn endowed with the standard probability measure µnx = 2−n. Let af be theaverage value of f and let mf be the median of f . Prove that

|af −mf | ≤√n ln 2

2,

which improves the bound of Problem 1 of Section 5.

In view of what’s been said, an isoperimetric inequality is a natural way tosupplement a martingale bound. A counterpart to the martingale concentration ofTheorem 8.1 is an isoperimetric inequality for general product spaces proved by M.Talagrand.

It turns out that the estimate of Theorem 4.2 holds in this general situation.

(10.1) Theorem. Let (Xi, µi) be probability spaces and let X = X1 × · · · × Xn

be the product space with the product measure µ = µ1 × · · · × µn. Let A ⊂ X be anon-empty set. Then, for f(x) = dist(x,A) and t > 0 we have∫

X

etf dµ ≤ 1µ(A)

(12

+et + e−t

4

)n≤ et

2n/4

µ(A).

40

Page 41: Measure Concentration

Disclaimer: all sets and functions encountered in the course of the proof areassumed to be measurable. This is indeed so in all interesting cases (X finite or Acompact, and so on).

Let us denote

c(t) =12

+et + e−t

4.

The proof (due to M. Talagrand) is by induction on n and is based on a lemma(also due to M. Talagrand), which simultaneously takes care of the case n = 1.

(10.2) Lemma. Let Y be a space with a probability measure ν and let g : Y −→ Rbe an integrable function such that 0 ≤ g(y) ≤ 1 for all y. Then, for any t ≥ 0(∫

Y

minet,

1g

)(∫Y

g dν

)≤ 1

2+et + e−t

4= c(t).

Proof. First, we note that if we replace g by maxg, e−t, the first integral doesnot change, while the second can only increase. Thus we have to prove that ife−t ≤ g(y) ≤ 1 for all y then

(10.2.1)(∫

Y

1gdν

)(∫Y

g dν

)≤ c(t).

Let us assume ν does not have atoms, that, is, for every measurable B ⊂ Y thereis a measurable C ⊂ B with ν(C) = ν(B)/2. Let us consider g ∈ L∞(Y, ν), let usfix e−t ≤ b ≤ 1 and let us consider the set Gb ⊂ L∞(Y, ν) of functions g such thate−t ≤ g ≤ 1 and ∫

Y

g dν = b.

Thus Gb is a weak compact∗ set, so the function

φ : g 7−→∫Y

1gdν

attains its maximum on Gb. The next observation is that Gb is convex, that φ isconvex, that Gb is the closed convex hull of the set of its extreme points (in theweak∗ topology) and that the extreme points of Gb consist of the functions g suchthat g = e−t or g = 1 almost everywhere (this is the point where we use that νhas no atoms: if e−t + ε < g < 1 − ε on a set of positive measure, we can perturbg a bit, thus representing as a midpoint of two functions g1, g2 ∈ Gb). Hence themaximum of φ is attained at a function g with g = e−t or g = 1 almost everywhere.Hence is suffices to check (10.2.1) for such functions g. We have

g(y) ∈e−t, 1

and

∫Y

g dν = b implies∫Y

1gdν = et − bet + 1.

41

Page 42: Measure Concentration

Hence (10.2.1) reduces to the inequality

b(et − bet + 1) ≤ c(t) for e−t ≤ b ≤ 1.

But the left hand side is a quadratic function of b. The maximum of the left handside occurs at b = (1 + e−t)/2 and is equal to c(t).

This completes the proof under the additional assumption that ν has no atoms.For a general ν, we note that the probability space Y is not so important. Intro-ducing the cumulative distribution function F of g, we can write (10.2.1) as(∫

R

1xdF

)(∫Rx dF

).

It remains to notice that F can be arbitrarily well approximated by a continuouscumulative distribution functions and that such functions correspond to measureswithout atoms.

Now we can prove Theorem 10.1.

Proof of Theorem 10.1. We proceed by induction on n. For n = 1, let g(x) =1− f(x). Thus g(x) = 1 if x ∈ A and g(x) = 0 if x /∈ A.(∫

X1

etf dµ1

)µ1(A) =

(∫X1

minet,

1g

dµ1

)(∫X1

g dµ1

)and the result follows by Lemma 10.2.

Suppose that n > 1. For z ∈ Xn, let us define

Az =

(x1, . . . , xn−1) : (x1, . . . , xn−1, z) ∈ A

and let B be the projection of A onto the first (n− 1) coordinates:

B =

(x1, . . . , xn−1) : (x1, . . . , xn−1, z) ∈ A for some z ∈ Xn

.

Hence Az, B ⊂ X1 × · · · × Xn−1. Let us denote Y = X1 × · · · × Xn−1 and ν =µ1 × · · · × µn−1. We denote y = (x1, . . . , xn−1) ∈ Y . By Fubini’s theorem,∫

X

etf dµ =∫Xn

(∫Y

etf(y,z) dν(y))dµn(z).

Now,

dist((y, z), A

)≤ dist(y,Az

)and dist

((y, z), A

)≤ dist(y,B) + 1.

42

Page 43: Measure Concentration

Therefore, by the induction hypothesis,∫Y

etf(y,z) dν(y) ≤∫Y

et dist(y,Az) dν(y) ≤ 1ν(Az)

cn−1(t)

and ∫Y

etf(y,z) dν(y) ≤∫Y

et dist(y,B) dν ≤ et

ν(B)cn−1(t).

Besides, by Fubini’s Theorem∫Xn

ν(Az) dµn(z) = µ(A).

Let us define a function g : Xn −→ X by g(z) = ν(Az)/ν(B). Since Az ⊂ B for allz, we have ν(Az) ≤ ν(B) and 0 ≤ g(z) ≤ 1 for all z. Moreover,∫

Xn

g dµn =µ(A)ν(B)

.

Applying Lemma 10.2, we have∫X

etf dµ ≤cn−1(t)∫Xn

min 1ν(Az)

,et

ν(B)

dµn(z)

=cn−1(t)ν(B)

∫Xn

min ν(B)ν(Az)

, etdµn(z)

=cn−1(t)ν(B)

∫Xn

min1g, et

dµn

≤ cn(t)ν(B)

(∫Xn

g dµn

)−1

=cn(t)µ(A)

,

which completes the proof of the first inequality. The second inequality is provedin Corollary 4.3.

Thus as in Section 4 (cf. Corollary 4.4 and Theorem 4.5) we get the isoperimetricinequality in the product space.

(10.3) Corollary. Let A ⊂ X1 × · · · × Xn be a non-empty set. Then, for anyε > 0, we have

µx ∈ In : dist(x,A) ≥ ε

√n≤ e−ε

2

µ(A).

Similarly, we get concentration about the median for Lipschitz functions on theproduct space.

43

Page 44: Measure Concentration

(10.4) Theorem. Let (Xi, µi) be probability spaces, let X = X1 × · · · × Xn andµ = µ1 × · · ·× be the product probability space. Let f : X −→ R be a function suchthat |f(x) − f(y)| ≤ 1 if x and y differ in at most 1 coordinate and let mf be themedian of f . Then for all ε > 0,

µx ∈ X : |f(x)−mf | ≥ ε

√n≤ 4e−ε

2.

Lecture 15. Wednesday, February 9

Lecture 14 on Monday, February 7, covered the proof of Theorem 10.1 from theprevious handout.

11. Why do we care: some examples leadingto discrete measure concentration questions

A rich source of questions regarding concentration inequalities is the theory ofrandom graphs. Suppose we have n points labeled 1, . . . , n and we connect each pair(i, j) by an edge at random with some probability p, independently of others. Thuswe get a random graph G and we may start asking about its various characteristics.For example, how many connected components may it have? What is the chromaticnumber of G, that is, the minimum number of colors we need to color the verticesof G so that the endpoints of every edge have different colors? What is the cliquenumber of G, that is the maximum cardinality of a subset S of vertices of G suchthat every two vertices from S are connected by an edge?

Let m =(n2

)and let Im = 0, 1m be the corresponding Boolean cube. We

interpret a point x = (ξ1, . . . , ξm) as a graph G, where ξk = 1 if the correspondingpair of vertices is connected by an edge and ξk = 0 otherwise. All the abovecharacteristics (the number of connected components, the chromatic number, theclique number) are represented by a function f : Im −→ N. Moreover, in all of theabove examples, the function is 1-Lipschitz (why?), so it must concentrate for largen.

An example of an interesting function that is not 1-Lipschitz is the number oftriangles in the graph G.

12. Large Deviations and Isoperimetry

A natural question is to ask whether bounds in Corollary 10.3 and Theorem 10.4can be sharpened, for instance, when ε is large. That is, given a set A ⊂ X with,say, µ(A) ≈ 1/2, we want to estimate more accurately

µx : dist(x,A) ≥ t

,

44

Page 45: Measure Concentration

when t is large. Since this measure is small anyway, the correct scale is logarithmic,so we will be estimating

lnµx : dist(x,A) ≥ t

.

Equivalently, we may think of a small set

B =x : dist(x,A) ≥ t

and ask ourselves how large a neighborhood of B should we take so that it will takeapproximately half of X:

µx : dist(x,B) ≤ s

≈ 1

2.

Questions of this kind are called the large deviation questions.To do that, we would need sharper isoperimetric inequalities in product spaces.

Unlike in the case of the standard Boolean cube where such exact isoperimetricinequalities are known (cf. Section 6), in most product spaces no exact isoperimet-ric inequalities are known. However, asymptotic isoperimetric inequalities in theproduct spaces have been recently discovered. To be able to discuss them, we needsome useful notion.

(12.1) The log-moment function. Let X,µ be a probability space and let f :−→R be a function. The log-moment function Lf : R −→ R associated to f is definedas follows:

Lf (λ) = ln∫X

eλf dµ = ln Eeλf .

If the integral diverges, we let Lf (λ) = +∞. The function Lf is also known as thecumulant generating function of f .

Here are some useful properties of Lf . We have

Lf (0) = ln 1 = 0.

Since eλx is a convex function, by Jensen’s inequality

Lf (λ) ≥ ln eEλf = λEf.

In particular, if Ef = 0 (which we will soon assume), then Lf is non-negative.Next,

L′f =E(feλf

)E (eλf )

and L′′f =E(f2eλf

)E(eλf)−E2

(feλf

)E2 (eλf )

,

provided all integrals converge, which we assume. By Cauchy-Schwarz inequality,

E2(feλf

)= E

(feλf/2eλf/2

)≤ E

(f2eλf

)E(eλf),

from which L′′f ≥ 0 and Lf is convex. Similarly, this can be extracted directly fromthe definition. Likewise, for any fixed λ, the function f 7−→ Lf (λ) is convex.

45

Page 46: Measure Concentration

(12.2) The rate function. Let g : R −→ R ∪ +∞ be a convex function. Thefunction

g∗(t) = supλ∈R

(tλ− g(λ)

)is called the Legendre-Fenchel transform of g. Thus g∗ : R −→ R∪+∞. Functiong∗ is also called the conjugate to g. If g and g∗ are proper, that is, do not acquire thevalue of +∞, there is a remarkable duality (g∗)∗ = g. If g is finite in a neighborhoodof λ = 0, then g∗ is finite in a neighborhood of t = 0.

For example, if g(λ) = λ2, then g∗(t) = t2/4 and if g(λ) = λ2/4 then g∗(t) = t2.The Legendre-Fenchel transform Rf (t) = Lf (λ)∗ of the log-moment function Lf

is called the rate function of f .The importance of the rate function is given by the following result. It is known

as the Large Deviations Theorem, and, apparently, goes back to Cramer.

(12.3) Theorem. Let X,µ be a probability space, let f : X −→ R be a functionand let t ∈ R be a number such that

(1)

Ef =∫X

f dµ = af exists

(2)

Lf (λ) =∫X

eλf dµ is finite in a neighborhood of λ = 0;

(3)t > af and µ

x : f(x) > t

> 0.

For a positive integer n, let us consider the product space Xn = X × . . . ×X withthe product measure µn = µ× · · · × µ. Let us define F : Xn −→ R by

F (x) = f(x1) + . . .+ f(xn) for x = (x1, . . . , xn).

Then Rf (t) > 0 and

limn−→+∞

n−1 lnµnx ∈ Xn : F (x) > nt

= −Rf (t),

where Rf is the rate function.Besides,

µn

x ∈ Xn : F (x) > nt

≤ e−nRf (t) for all n.

In other words, if we sample n points x1, . . . , xn ∈ X independently and at ran-dom, the probability that the average (f(x1) + . . .+ f(xn)) /n exceeds t is roughlyof the order of exp

−nRf (t)

. This should be contrasted with the estimates of

Section 8.2.46

Page 47: Measure Concentration

(12.4) Tossing a fair and an unfair coin again. Suppose that X = 0, 1 andµ0 = µ1 = 1/2. Let f(x) = x− 1/2 for x ∈ X. Then

Lf (λ) = ln(eλ/2 + e−λ/2

2

).

The maximum of tλ− Lf (λ) is attained at

λ = ln1 + 2t1− 2t

if − 1/2 < t < 1/2,

so

Rf (t) = −(

12

+ t

)ln

11 + 2t

−(

12− t

)ln

11− 2t

= −H(

12− t

)+ ln 2,

whereH(x) = x ln

1x

+ (1− x) ln1

1− x

is the entropy function.Let us define the product space Xn, µn and the function F : Xn −→ R as in

Theorem 12.3. Then

F (x1, . . . , xn) = x1 + . . .+ xn − n/2

is interpreted as the deviation of the number of heads shown by a fair coin in ntosses from the expected number n/2 of heads. Thus Theorem 12.3 tells us thatthe probability that the number of heads exceeds n(t+ 0.5) is of the order of

2−n expnH

(12− t

)for 0 < t <

12,

which we established already in Example 9.1.Let us modify the above example. We still have X = 0, 1, but now µ1 = p

and µ0 = 1− p = q for some 0 < p < 1. Also, f(x) = x− p. In this case,

Lf (λ) = ln(peλq + qe−λp

).

The maximum of tλ− Lf (λ) is attained at

λ = lnq(t+ p)p(q − t)

for − p < t < q

Consequently,

Rf (t) = (t+ p) ln(q(p+ t)p(q − t)

)− ln

(q

q − t

).

47

Page 48: Measure Concentration

Let us define the product space Xn, µn and let

F (x1, . . . , xn) = x1 + . . .+ xn − np.

Thus F (x) is the deviation of the number of heads shown by an unfair coin fromthe expected number np of heads. Theorem 12.3 tells us that the probability thatthe number of heads exceeds nt is of the order of

exp−nRf (b)

=

((p

p+ t

)p+t(q

q − t

)q−t)nfor 0 < t < q.

We don’t prove Theorem 12.3, but note that the upper bound for

µn

x ∈ Xn : F (x) > nt

by now is a simple exercise in the Laplace transform method:

µn

x ∈ Xn : F (x) > nt

≤e−ntλEeλF (x) =

(e−tλEeλf

)n= exp

−tλ+ ln Eλf

n= exp

− (tλ− Lf (λ))

n≤e−Rf (t)n.

(12.5) Large deviations in the product spaces. Let (X,µ) be a finite prob-ability metric space with the metric dist. For a positive integer n, let us considerthe product

Xn = X × · · · ×X

with the product measure µn = µ× · · · × µ and the metric

distn(x, y) =n∑i=1

dist(xi, yi) for x = (x1, . . . , xn) and y = (y1, . . . , yn).

Let us choose a t > 0 and consider the following quantity

maxA⊂X

µn(A)≥1/2

µn

x : distn(x,A) ≥ t

.

In words: we are interested in the maximum possible measure of the complementof the t-neighborhood of a set of measure at least 1/2. Furthermore, we will beinterested in the case when t grows proportionately with n, t ≈ αn for some α > 0.Clearly, we shouldn’t try to choose α too large. Let us define

κ = maxx∈X

µx>0

E dist(x, ·)

In words: we pick a point x ∈ X such that µx > 0, compute the average distanceto x and take the maximum over all such points x ∈ X. It is more or less clear thatwe should choose t < κn, since otherwise the measure of the set we care about isway too small even by our standards.

The following remarkable result was obtained by N. Alon, R. Boppana, and J.Spencer (in fact, they proved a more general result).

48

Page 49: Measure Concentration

(12.6) Theorem. For a real λ, let us define L(λ) as the maximum of Lf (λ) =ln Eeλf taken over all 1-Lipschitz functions f : X −→ R such that Ef = 0. For at > 0, let us define

R(t) = supλ∈R

λt− L(λ).

Let us choose 0 < α < κ. Then

limn−→+∞

n−1 ln maxA⊂Xn

µn(A)≥1/2

µn

x : distn(x,A) ≥ αn

= −R(α).

(12.7) Example: weighted Boolean cube. Let X = 0, 1 with the metricdist(0, 1) = 1 and the probability measure µ1 = p, µ0 = 1 − p = q for some0 < p < 1. We assume that p ≤ 1/2. Then Xn = 0, 1n is the Boolean cube, distnis the familiar Hamming distance, and µn is the familiar measure of the “unfaircoin”:

µnx = pk(1− q)n−k provided x = (ξ1, . . . , ξn) andn∑i=1

ξi = k.

If x = 0 then the average distance to x is p and if x = 1 then the average distanceto x is q. Since we assumed that p ≤ q, we must choose κ = q in Theorem 12.6.

Moreover, it is not hard to see that the 1-Lipschitz function f : 0, 1 −→ R suchthat Ef = 0 and Eeλf is the largest possible (for λ ≥ 0) is defined by f(1) = q andf(0) = −p (again, we used that p ≤ q). Thus L(λ) = ln

(peλq + qe−λp

)for λ ≥ 0,

just as in Example 12.4. Consequently,

R(t) = (t+ p) ln(q(p+ t)p(q − t)

)− ln

q

q − tfor 0 < t < q.

PROBLEM. In the above example, let An ⊂ 0, 1n be a Hamming ball centeredat (0, . . . , 0) such that µn(An) = 1/2 + o(1). Prove that

limn−→+∞

n−1 lnµnx : distn(x,A) ≥ αn

= −R(α).

Lecture 16. Friday, February 11

49

Page 50: Measure Concentration

12. Large Deviations and Isoperimetry, Continued

We don’t prove Theorem 12.6, but discuss some ideas of the proof by Alon,Boppana, and Spencer. This is a martingale proof in spirit.

(12.8) The additive property of L(λ). Let us look closely at the quantityL(λ) = LX(λ) introduced in Theorem 12.6, that is, the maximum of ln Eeλf takenover all 1-Lipschitz functions f : X −→ R such that Ef = 0. The crucial ob-servation is that L(λ) is additive with respect to the direct product of spaces.Namely, suppose that (X1, µ1) and (X2, µ2) are probability metric spaces withmetrics dist1 and dist2 and let X = X1 ×X2 be the space with the product mea-sure µ = µ1 × µ2 and the distance dist(x, y) = dist1(x1, y1) + dist2(x2, y2) forx = (x1, x2) and y = (y1, y2). We claim that

LX(λ) = LX1(λ) + LX2(λ).

It is easy to see that LX(λ) ≥ LX1(λ) + LX2(λ). Indeed, if fi : Xi −→ R are 1-Lipschitz functions with Ef1 = Ef2 = 0 and we define f : X −→ R by f(x1, x2) =f1(x1) + f2(x2) then f is 1-Lipschitz, Ef = 0 and∫

X

eλf dµ =(∫

X1

eλf1 dµ1

)(∫X2

eλf2 dµ2

),

from which Lf (λ) = Lf1(λ) + Lf2(λ), so we have

LX1(λ) + LX2(λ) = maxf1

Lf1(λ) + maxf2

Lf2(λ)

= maxf1,f2

(Lf1(λ) + Lf2(λ)

)= maxf=f1+f2

Lf (λ) ≤ LX(λ).

The heart of the argument is the inequality in the opposite direction: LX(λ) ≤LX1(λ) + LX2(λ). Let us choose a 1-Lipschitz function f : X −→ R such thatEf = 0. Let us define g : X1 −→ R by

g(x1) =∫X2

f(x1, x2) dµ2(x2).

Thus g is the conditional expectation of f with respect to the first variable x1.Clearly, E(g) = 0. Moreover, g is 1-Lipschitz since

|g(x1)− g(y1)| =∣∣∣ ∫X2

f(x1, x2)− f(y1, x2) dµ2(x2)∣∣∣

≤∫X2

|f(x1, x2)− f(y1, x2)| dµ2(x2)

≤∫X2

dist1(x1, y1) dµ2 = dist1(x1, y1).

50

Page 51: Measure Concentration

Now, for each x1 ∈ X1, let us define a function hx1 : X1 −→ R by hx1(x2) =f(x1, x2)− g(x1). Again, Ehx1(x2) = 0 for all x1 ∈ X1 since

Ehx1 =∫X2

hx1(x2) dµ2(x2) =∫X2

f(x1, x2)− g(x1) dµ2(x2) = g(x1)− g(x1) = 0.

Moreover, hx1 is 1-Lipschitz, since

|hx1(x2)− hx1(y2)| =|f(x1, x2)− g(x1)− f(x1, y2) + g(x1)|=|f(x1, x2)− f(x1, y2)| ≤ dist(y1, y2).

Hence we conclude that

ln Eeλg ≤ LX1(λ) and ln Eeλhx1 ≤ LX2(λ) for all λ.

Besides, f(x1, x2) = g(x1) + hx1(x2). Therefore,∫X

eλf dµ =∫X1

(∫X2

eλf dµ2

)dµ1 =

∫X1

(∫X2

eλgeλhx1 dµ2

)dµ1

=∫X1

eλg(∫

X2

eλhx1 dµ2

)dµ1 ≤

∫X1

eλgeLX2 (λ) dµ1

=eLX2 (λ)

∫X1

eλg dµ1 ≤ eLX2 (λ)eLX1 (λ).

Taking the logarithm, we happily conclude that

ln Eeλf ≤ LX1(λ) + LX2(λ),

and since the function f was arbitrary, LX(λ) ≤ LX1(λ) + LX2(λ), which finallyproves that LX(λ) = LX1(λ) + LX2(λ).

PROBLEMS.1. Suppose that X consists of 3 points, X = 1, 2, 3 and that µ1 = µ2 =

µ3 = 1/3. Prove that

LX(λ) = ln(

13e2λ/3 +

23e−λ/3

)for λ ≥ 0.

2. Let X,µ be a finite probability space with the Hamming metric

dist(x, y) =

1 if x 6= y

0 if x = y.

Let f : X −→ R be a 1-Lipschitz function with Ef = 0 which achieves the maximumof Eλf for some λ. Prove that f takes at most two values.

Now we go back to Theorem 12.6.51

Page 52: Measure Concentration

(12.9) Proving the upper bound in Theorem 12.6. Let us choose a setA ⊂ Xn with µn(A) ≥ 1/2. Let us define f : X −→ R by f(x) = dist(x,A) andlet a = Ef . It follows from almost any result we proved so far (see Corollary 10.3,Theorem 8.1, etc.) that a = O(

√n) = o(n). Next, we note that f −a is 1-Lipschitz

and that E(f − a) = 0. Therefore,

ln Eeλ(f−a) ≤ LXn(λ) = nLX(λ),

as we proved above.Now, applying the Laplace transform bound, let us choose λ ≥ 0. Then

µn

x : f(x) ≥ αn

≤e−λαnEeλf = e−λαneλaEeλ(f−a)

≤e−λ(αn+o(1))eLXn (λ) = exp(−λ(α+ o(1)) + LX(λ)

)n.

Hence

n−1 lnµnx : f(x) ≥ αn

≤ −λ

(α+ o(1)

)+ LX(λ) = −

(λ(α+ o(1))− LX(λ

).

Optimizing on λ (note that since LX(λ) ≥ 0, the optimal λ is non-negative), weget that

n−1 lnµnx : f(x) ≥ αn

≤ −R

(α+ o(1)

).

(12.10) The spread constant of a space. In the same paper, Alon, Boppana,and Spencer study another interesting parameter associated with a probability met-ric space, the spread constant. The spread constant c(X) is the maximum value ofE(f2) taken over all 1-Lipschitz functions f : X −→ R such that Ef = 0. Theyprove that in the situation of Theorem 12.6, for n1/2 t n, we have

maxA⊂Xn

µn(A)≥1/2

µn

x : distn(x,A) ≥ t

= exp

− t2

2cn(1 + o(1)

),

which sharpens the estimates of Corollary 10.3 and Theorem 8.1.

PROBLEMS.1. Let X be a finite probability measure space. Prove that

limλ−→0

LX(λ)λ2

=c(X)

2.

2. Deduce that for X = X1 × X2 with the product measure µ = µ1 × µ2 andthe distance dist(x, y) = dist(x1, y1) + dist(x2, y2) for x = (x1, y1) and y = (y1, y2),one has c(X) = c(X1) + c(X2).

52

Page 53: Measure Concentration

3. Let In = 0, 1n be the Boolean cube. Let us choose a number 0 < p ≤ 1/2and let q = 1− p. Let us introduce the product measure µn by µn(x) = pkqn−k ifthere are exactly k 1’s among the coordinates of x. Prove the following asymptoticisoperimetric inequality. Let An ⊂ In be a sequence of sets and let Bn ⊂ In be aball centered at (1, . . . , 1) and such that | lnµn(Bn) − lnµn(An)| = o(n). Let uschoose a number 0 < α < 1. Prove that

lnµnx : dist(x,An) ≤ αn

≥ lnµn

x : dist(x,Bn) ≤ αn

+ o(n).

4. Let X = 1, 2, 3 with pairwise distances equal to 1 and the uniform proba-bility measure µ1 = µ2 = µ3 = 1/3. Let Xn = X × . . .×X be the productspace with the product measure and the Hamming metric. Prove the followingasymptotic isoperimetric inequality. Let An ⊂ Xn be a sequence of sets and letBn ⊂ Xn be a ball such that | lnµn(An) − lnµn(Bn)| = o(n). Let us choose anumber 0 < α < 1. Prove that

lnµnx : dist(x,An) ≤ αn

≥ lnµn

x : dist(x,Bn) ≤ αn

+ o(n).

Lecture 18. Wednesday, February 16

Lecture 17 on Monday, February 14, covered the material in the previous hand-out.

13. Gaussian measure on Euclidean Space as a limit ofthe projection of the uniform measure on the sphere

Having studied product spaces for a while, we go back now with the newlyacquired wisdom to where we started, to the Gaussian measure on Euclidean spaceand the uniform measure on the sphere.

We will need to compute some things on the unit sphere Sn−1 ⊂ Rn. First, wewill need to compute the “surface area” (which we just call “volume”) |Sn−1| of theunit sphere.

(13.1) Lemma. For the unit sphere Sn−1 ⊂ Rn, we have

|Sn−1| =2πn/2

Γ(n/2),

where

Γ(t) =∫ +∞

0

xt−1e−x dx for t > 0

is the Gamma-function.53

Page 54: Measure Concentration

Proof. Letpn(x) = (2π)−n/2e−‖x‖/2 for x ∈ Rn

be the standard Gaussian density in Rn and let

Sn−1(r) =x : ‖x‖ = r

be the sphere of radius r. We integrate pn over Rn using the polar coordinates:

1 =∫

Rn

pn(x) dx = (2π)−n/2∫ +∞

0

e−r2/2|Sn−1(r)| dr

=(2π)−n/2|Sn−1|∫ +∞

0

rn−1e−r2/2 dr.

The integral reduced to the Gamma-function by the substitution r2/2 = t ⇔ r =(2t)1/2.∫ +∞

0

rn−1e−r2/2 dr = 2(n−2)/2

∫ +∞

0

t(n−2)/2e−t dt = 2(n−2)/2Γ(n/2).

Therefore,

|Sn−1| =(2π)n/22(2−n)/2

Γ(n/2)=

2πn/2

Γ(n/2),

as claimed.

We recall that Γ(t+ 1) = tΓ(t), so that Γ(t) = (t− 1)! for positive integer t andthat Γ(1/2) =

√π. Stirling’s formula says that

Γ(t+ 1) =√

2πt(t

e

)t (1 +O(t−1)

)as t −→ +∞.

We have already shown how to “condense” the uniform probability measure on thesphere of radius ∼

√n in Rn (Section 2) and how to obtain the uniform probability

measure on the unit sphere Sn−1 from the Gaussian measure on Rn via the radialprojection Rn \ 0 −→ Sn−1, Section 3.3. However, the relation between theGaussian measure on Euclidean space and the uniform measure on the sphere ismore interesting. We show how to obtain the Gaussian measure on Euclidean spaceas a limit of the push-forward of the uniform measure on the sphere.

(13.2) Theorem. Let Σn ⊂ Rn be the sphere centered at the origin and of radius√n,

Σn =x ∈ Rn : ‖x‖ =

√n,

54

Page 55: Measure Concentration

and let µn be the uniform (that is, rotationally invariant, Borel) probability measureon Σn. Let us consider the projection φ : Σn −→ R

(ξ1, . . . , ξn) 7−→ ξ1

and let νn be the push-forward measure on the line R1, that is,

νn(A) = µn(φ−1(A)

)for any Borel set A ⊂ R. Let γ1 be the standard Gaussian measure on R with thestandard normal density

1√2πe−ξ

2/2.

Then, for any Borel set A ⊂ R, we have

limn−→+∞

νn(A) = γ1(A) =1√2π

∫A

e−ξ2/2 dξ.

Moreover, the density of νn converges to the standard normal density uniformly oncompact subsets of R.

Proof. Let us choose an interval A = (a, b) ⊂ R and compute νn(A). Let B =φ−1(A) ⊂ Σn. Then B is the subset of the sphere Σn consisting of the points suchthat a < ξ1 < b. If we fix the value of ξ1, we obtain the set of points that is the(n − 2)-dimensional sphere of radius

√n− ξ21 . Note that since a and b are fixed

and n grows, for a < ξ1 < b, we have dξ1 almost tangent to the sphere, so

(1 + o(1)

)µn(B) =

|Sn−2|n(n−1)/2|Sn−1|

∫ b

a

(n− ξ21)(n−2)/2 dξ1

=|Sn−2|n(n−2)/2

n(n−1)/2|Sn−1|

∫ b

a

(1− ξ21

n

)(n−2)/2

dξ1

=1√π

Γ(n2

)√nΓ(n−1

2

) ∫ b

a

(1− ξ21

n

)(n−2)/2

dξ1.

It follows from Stirling’s formula that

limn−→+∞

Γ(n2

)√nΓ(n−1

2

) =1√2.

Next, (1− ξ2

n

)n−22

= expn− 2

2ln(

1− ξ2

n

).

55

Page 56: Measure Concentration

Using that

ln(

1− ξ2

n

)= −ξ

2

n+O

(n−2

),

we conclude that (1− ξ2

n

)n−22

−→ e−ξ2/2

uniformly on the interval (a, b), which completes the proof.

Similarly, the standard Gaussian measure on Euclidean space of any dimensioncan be obtained as the limit of the push-forward of the uniform probability measureon the sphere under the projection onto a subset of coordinates.

(13.3) Corollary. Let Σn ⊂ Rn be the sphere centered at the origin and of radius√n and let µn be the uniform probability measure on Σn. For a fixed k, let us

consider the projection Σn −→ Rk

(ξ1, . . . , ξn) 7−→ (ξ1, . . . , ξk)

and let νn be the push-forward measure on Rk. Let γk be the standard Gaussianmeasure on Rk with the standard normal density

(2π)−k/2e−‖x‖2/2.

Then, for any Borel set A ⊂ Rk, we have

limn−→+∞

νn(A) = γk(A) = (2π)−k/2∫A

e−‖x‖2/2 dx.

Moreover, the density of νn converges to the standard normal density uniformly oncompact subsets of Rk.

Proof. Perhaps the easiest way to obtain the corollary is via the Fourier transform(characteristic functions) method. Let us consider the characteristic function Fn :Rk −→ C of the measure νn: for a vector c ∈ Rk, we have

Fn(c) =∫

Rk

ei〈c,x〉 dνn(x).

Similarly, let G : Rk −→ R be the characteristic function of the standard Gaussianmeasure on Rk:

G(c) = (2π)−k/2∫

Rk

ei〈c,x〉e−‖x‖2/2 dx = e−‖c‖

2/2

(check the formula). The result would follow if we can prove that Fn(c) −→ G(c)uniformly on compact subsets of Rn (we can recover densities from Fn(c) and G(c)by the inverse Fourier transform).

56

Page 57: Measure Concentration

We can writeFn(c) =

∫Σn

ei〈c,x〉 dµn(x).

Since the measure µn on Σn is rotation-invariant, we can assume thatc = (‖c‖, 0, . . . , 0). Therefore,

Fn(c) =∫

Σn

ei‖c‖ξ1 dµn(x).

Now, using Theorem 13.3, we conclude that as n −→ +∞ we have

Fn(c) −→∫

Rei‖c‖ξ dγ1(ξ) =

1√2π

∫Rei‖c‖ξe−ξ

2/2 dξ = e−‖c‖2/2

and that the convergence is uniform on compact subsets. This is what is needed.

Lecture 19. Friday, February 18

14. Isoperimetric inequalities for thesphere and for the Gaussian measure

Let Sn−1 ⊂ Rn be the unit sphere

Sn−1 =x : ‖x‖ = 1

with the rotation invariant (Haar) probability measure µ = µn and the geodesicmetric

dist(x, y) = arccos〈x, y〉,

cf. Section 1.1. We define the spherical cap as a ball in the metric of Sn−1:

Ba(r) =x ∈ Sn−1 : dist(x, a) ≤ r

.

The famous result of P. Levy states that among all subset A ⊂ Sn−1 of a givenmeasure, the spherical cap has the smallest measure of the neighborhood.

(14.1) Theorem. Let A ⊂ Sn−1 be a closed set and let t ≥ 0 be a number. LetB = B(a, r) ⊂ Sn−1 be a spherical cap such that

µ(A) = µ(B).

Then

µx : dist(x,A) ≤ t

≥ µ

x : dist(x,B) ≤ t

= µ (Ba(r + t)) .

57

Page 58: Measure Concentration

We don’t prove this nice theorem here, but extract some corollaries instead.Clearly, if B ⊂ Sn−1 is a spherical cap such that µ(B) = 1/2 then the radius of Bis π/2, so B = Ba(π/2) for some a. In this case,

Ba(π/2 + t) = Sn−1 \B−a(π/2− t)

(why?), soµ (Ba(π/2 + t)) = 1− µ (B−a(π/2− t)) .

We want to estimate the measure of the spherical cap of radius π/2 − t for 0 ≤t ≤ π/2. One way to do it is to note that the cap consists of the vectors x whoseorthogonal projection onto the hyperplane a⊥ has length at most cos(t) ≈ 1− t2/2for small t and use Corollary 3.4, say. This would be good enough, though it doesnot give the optimal constant. One can compute the measure directly. Note thatit is convenient to “shift” the dimension and to prove the result for the (n + 1)-dimensional sphere Sn+2 ⊂ Rn+2.

(14.2) Lemma. For the spherical cap B ⊂ Sn+1 of radius π/2− t, we have

µn+2(B) ≤√π

8e−t

2n/2.

Besides, for any t > 0,

µn+2(B) ≤ 12e−t

2n/2(

1 + o(1))

as n −→ +∞.

Proof. We introduce a coordinate system so that (1, 0, . . . , 0) becomes the centerof the cap. Let us slice the cap by the hyperplanes ξ1 = cosφ onto n-dimensionalspheres of radius sinφ. Then

µ(B) =|Sn||Sn+1|

∫ π/2−t

0

sinn φ dφ =2π

n+12

Γ(n+1

2

) Γ(n+2

2

)2π

n+22

∫ π/2

t

cosn φ dφ

=Γ(n+2

2

)Γ(n+1

2

)√π

∫ π/2

t

cosn φ dφ =Γ(n+2

2

)Γ(n+1

2

)√π√n

∫ π√n/2

t√n

cosn(ψ√n

)dψ

Now we use the inequality

cosα ≤ e−α2/2 for 0 ≤ α ≤ π

2to bound

µ(B) ≤Γ(n+2

2

)Γ(n+1

2

)√πn

∫ π√n/2

t√n

e−ψ2/2 dψ

≤Γ(n+2

2

)Γ(n+1

2

)√πn

e−t2n/2

∫ (π/2−t)√n

0

e−ψ2/2 dψ

≤Γ(n+2

2

)Γ(n+1

2

)√πn

e−t2n/2

∫ +∞

0

e−ψ2/2 dψ =

Γ(n+2

2

)Γ(n+1

2

)√2ne−t

2n/2

58

Page 59: Measure Concentration

Now it remains to show that

Γ(n+2

2

)Γ(n+1

2

)√2n

≤√π

8,

which is done by observing that replacing n by n + 2 gets the fraction multipliedby

(n+ 4)(n+ 3)

√n

n+ 2,

that is makes is slightly smaller. Hence the maximum value is attained at n = 2 orat n = 1, which is indeed the case and is equal to

√π/8.

As follows from Stirling’s formula (cf. also the proof of Theorem 13.2),

limn−→+∞

Γ(n+2

2

)Γ(n+1

2

)√2n

=12.

An immediate corollary of Theorem 14.1 and Lemma 14.2 is the concentrationresult for Lipschitz functions on the unit sphere.

(14.3) Theorem. Let f : Sn+1 −→ R be a 1-Lipschitz function and let mf be itsmedian, that is, the number such that

µx : f(x) ≥ mf

≥ 1

2and µ

x : f(x) ≤ mf

≥ 1

2.

Then, for any ε > 0,

µx : |f(x)−mf | ≥ ε

≤√π

2e−ε

2n/2.

Besides, for any ε > 0,

µx : |f(x)−mf | ≥ ε

≤ e−ε

2n/2(

1 + o(1))

as n −→ +∞.

Proof. The proof follows the same lines as the proof of Theorem 4.5 (see also Section10). We define

A+ =x : f(x) ≥ mf

and A− =

x : f(x) ≤ mf

so that µ(A+), µ(A−) ≥ 1/2. Therefore, by Theorem 14.1, the measure of the ε-neighborhoods of A− and A+ is at least as large as the measure of theε-neighborhood of the hemisphere, which, by Lemma 14.2 is at least

59

Page 60: Measure Concentration

1 −√π/8e−ε

2n/2. This, in turn, implies that the measure of the intersection ofthe ε-neighborhoods of A+ and A− is at least 1 −

√π/2e−ε

2n/2. However, for allx ∈ A+(ε) ∩A−(ε) we have |f(x)−mf | ≤ ε which completes the proof.

Let us turn now to the space Rn with the standard Euclidean distance

dist(x, y) = ‖x− y‖

and the standard Gaussian measure γ = γn with the density

(2π)−n/2e−‖x‖2/2.

We define a halfspace as a set of the type:

H(α) =

(ξ1, . . . , ξn) : ξ1 ≤ α

for some α ∈ R. In general, we call a halfspace the set of points whose scalarproduct with a given non-zero vector does not exceed a given number.

The following remarkable result due to C. Borell states that among all subsetsA ⊂ Rn of a given measure γ, halfspaces have the smallest measure of the neigh-borhood.

(14.4) Theorem. Let A ⊂ Rn be a closed set and let t ≥ 0 be number. LetH = H(α) ⊂ Rn be a halfspace such that

γ(A) = γ(H).

Then

γx : dist(x,A) ≤ t

≥ γ

x : dist(x,H) ≤ t

= γ

(H(α+ t)

).

We don’t prove this nice theorem (at least, not now), but extract some corollariesinstead. Clearly, if H = H(α) is a halfspace such that µ(H) = 1/2 then α = 0. Inthis case,

γ(H(t)

)=

1√2π

∫ t

−∞e−ξ

2/2 dξ = 1− 1√2π

∫ +∞

t

e−ξ2/2 dξ,

so we want to estimate the integral.

(14.5) Lemma. For the halfspace H = H(t) with t ≥ 0, we have

γ(H) ≥ 1− e−t2/2.

60

Page 61: Measure Concentration

Besides,

γ(H) ≥ 1− e−t2/2

t.

Proof. We use the by now standard Laplace transform method. For any λ ≥ 0,

γx : ξ1 ≥ t

≤ e−λtEeλx = e−λt

1√2π

∫ +∞

−∞eλξe−ξ

2/2 dξ = eλt−λ2/2.

Optimizing on λ, we substitute λ = t, from which we get the desired estimate. Toget the second estimate, we note that∫ +∞

t

e−ξ2/2 dξ ≤ t−1

∫ +∞

t

ξe−ξ2/2 =

e−t2/2

t.

Just as before, we obtain a concentration result.

(14.6) Theorem. Let f : Rn −→ R be a 1-Lipschitz function and let mf be itsmedian. Then, for any ε > 0,

γx : |f(x)−mf | ≥ ε

≤ 2e−ε

2/2.

Besides,

γx : |f(x)−mf | ≥ ε

≤ 2εe−ε

2/2.

Hence we observe that as long as ε −→ +∞, the probability that f(x) deviatesfrom the median by more than ε tends to 0.

(14.7) Relations between the isoperimetric inequalities on the sphereand for the Gaussian measure in Rn.

One can deduce Theorem 14.4 as a “limit case” of Theorem 14.1. Let us choosea large number N , let us consider the unit sphere SN−1 ⊂ Rn with the uniformprobability measure µ and the scaled projection SN−1 −→ Rn.

φ : (ξ1, . . . , ξN ) 7−→√N(ξ1, . . . , ξn).

Corollary 13.3 tells us that the push-forward ν converges to the Gaussian measure γon Rn as N grows. Let A ⊂ Rn be a closed bounded set. Then C = φ−1(A) ⊂ SN−1

is a closed set and since we assumed that A is bounded, for every x = (ξ1, . . . , ξN ) ∈C we have |ξi| = O(N−1/2) for i = 1, . . . , n. Besides,

µ(C) = γ(A) + o(1).61

Page 62: Measure Concentration

Let us choose a halfspace H ⊂ Rn, H =x : ξ1 ≤ α

such that γ(H) = γ(A).

Then B = φ−1(H) is the spherical cap in SN−1 centered at (−1, 0, . . . , 0) anddefined by the inequality ξ1 ≤ N−1/2α. Similarly, we have

µ(B) = γ(H) + o(1).

Let us compare the t-neighborhoods of At and Ht of A and H respectively:

At =x ∈ Rn : dist(x,A) ≤ t

and Ht =

x ∈ Rn : ξ1 ≤ α+ t

.

Then

γ(At) = µ(φ−1(At)

)+ o(1) and γ(Ht) = µ

(φ−1(Ht)

)+ o(1).

Now, φ−1(Ht) is the spherical cap in SN−1 defined by the inequality ξ1 ≤ N−1/2(α+t). This is not the N−1/2t-neighborhood of the spherical cap B but somethingvery close to it. On the sphere SN−1 we measure distances by the length of thegeodesic arc. However, if the points on the sphere are close enough, the lengthof the arc approaches the Euclidean distance between the points. That is, if thegeodesic length is β, say, then the Euclidean distance is 2 sinβ/2 = β + O(β3) forsmall β. This allows us to prove that if we choose ε = N−1/2t and consider theε-neighborhood Bε of B, we get

µ(Bε) = µ(φ−1(Ht)

)+ o(1).

To claim that, we need the estimate of Lemma 13.1 for the volume of the sphere.Namely, we need that |SN−1|/|SN−2| = O(N1/2), so that by modifying distancesby something of the order of O(N−3/2) we don’t lose any substantial volume.

Similarly, φ−1(At) is not the ε-neighborhood Cε of C but something very closeto it, so

µ(Cε) = µ(φ−1(At)

)+ o(1).

By the isoperimetric inequality on the sphere, µ(Cε) ≥ µ(Bε) and hence

γ(At) ≥ γ(Ht) + o(1).

Taking the limit as N −→ +∞, we get

γ(At) ≥ γ(Ht),

that is, the isoperimetric inequality for the Gaussian measure.We assumed that A is bounded, but the inequality extends to unbounded sets

by a limit argument (approximate A by bounded subsets).62

Page 63: Measure Concentration

(14.8) Looking at the unit sphere through a microscope. What we did inSection 14.7 provokes the following thought. Let us looks at a higher-dimensionalunit sphere SN−1. Let us choose some constant number n of coordinates ξ1, . . . , ξnand let us consider the t-neighborhood of the section ξ1 = . . . = ξn = 0 for t =εNN

−1/2 where εN −→ +∞, however slowly. Then the measure of the sphereoutside of the neighborhood tends to 0 as N grows. In the “normal direction”the neighborhood of any particular point of the section ξ1 = . . . = ξn = 0 lookssomething akin to Rn endowed with the Gaussian measure.

Lecture 20. Monday, February 21

15. Another concentration inequality for the Gaussian measure

We discuss a relatively short, though far from straightforward, proof of a concen-tration result for the Gaussian measure in Rn. The result concerns concentrationwith respect to the average, not the median, and extends to a somewhat largerclass of functions f : Rn −→ R than Lipschitz functions. The proof is due to B.Maurey and G. Pisier.

The proof is based on several new (to us) ideas.The first idea: function f : Rn −→ R does not deviate much from its average

if and only if f does not deviate much from itself. Namely, let Rn be anothercopy of Rn and let us consider the direct sum R2n = Rn ⊕ Rn endowed withthe product Gaussian measure γ2n = γn × γn. Let us define F : R2n −→ R byF (x, y) = f(x) − f(y). Then f concentrates around somewhere if and only if Fconcentrates around 0.

The second idea: the Gaussian measure is invariant under orthogonal trans-formations, that is, for any measurable A ⊂ Rn, any orthogonal transformationU : Rn −→ Rn, for B =

Ux : x ∈ A

we have γn(B) = γn(A). We will also state

this as follows: if x = (ξ1, . . . , ξn) is a vector of independent standard Gaussianrandom variables and U is an orthogonal transformation, then Ux is a vector ofindependent standard Gaussian random variables. In particular, we will use thefollowing orthogonal transformation of R2n = Rn ⊕ Rn:

(x, y) 7−→((sin θ)x+ (cos θ)y, (cos θ)x− (sin θ)y

).

Next, we will work with differentiable functions f : Rn −→ R. Let

∇f =(∂f

∂ξ1, . . . ,

∂f

∂ξn

)denote the gradient of f . As usual,

‖∇f‖2 =n∑i=1

(∂f

∂xi

)2

63

Page 64: Measure Concentration

is the squared length of the gradient. If ‖∇f‖ ≤ 1 then f is 1-Lipschitz. If f is1-Lipschitz, it can be approximated by smooth functions with ‖∇f‖ ≤ 1 + ε (wewill not discuss how and why).

For a function f : Rn −→ R we denote by Ef its average with respect to thestandard Gaussian measure in Rn:

Ef = (2π)−n/2∫

Rn

f(x)e−‖x‖2/2 dx.

If f is the exponentiation of a linear function, the expectation can be computedexactly:

Ef = e‖a‖2/2 for f(x) = e〈a,x〉.

We will repeatedly (namely, twice) use the following particular version of Jensen’sinequality: if f : X −→ R is a function on a probability space (X,µ), then∫

X

ef dµ ≥ exp∫

X

f dµ,

which follows since ex is a convex function.One, by now standard, feature of the result is that we use the Laplace transform

estimate.

(15.1) Theorem. Let f : Rn −→ R be a smooth function such that E(f) = 0.Then

E expf ≤ E expπ2

8‖∇f‖2

,

assuming that both integrals are finite.

Proof. Let us consider the second copy of Rn endowed with the standard Gaussianmeasure γn and let R2n = Rn ⊕ Rn endowed with the product measure γ2n.

Let us define F : R2n −→ R by F (x, y) = f(x)− f(y). We claim that

E expf ≤ E expF

Indeed,

E expF =∫

Rn⊕Rn

ef(x)e−f(y) dγ2n(x, y)

=∫

Rn

ef(x)

(∫Rn

e−f(y) dγn(y))dγn(x).

Now, E(−f) = 0 and hence by Jensen’s inequality (first use) the inner integral isat least 1. Therefore, EeF ≥ Eef as claimed.

We will be proving that

E expF ≤ E expπ2

8‖∇f‖2

.

64

Page 65: Measure Concentration

Let us consider the following function

G : Rn × Rn × [0, π/2] −→ R, G(x, y, θ) = f(

(sin θ)x+ (cos θ)y).

Then

G(x, y, 0) = f(y), G(x, y, π/2) = f(x), and

F (x, y) = G(x, y, π/2)−G(x, y, 0) =∫ π/2

0

∂θG(x, y, θ) dθ.

Therefore,

E expF = E exp∫ π/2

0

∂θG(x, y, θ) dθ

.

Of course, we would like to apply Jensen’s inequality once again and switch expand the integral against dθ. This can be done but with some care since dθ is not aprobability measure on the interval [0, π/2]. To make it a probability measure, wehave to replace dθ by (2/π) dθ thus multiplying the integrand by π/2. This leadsto

exp∫ π/2

0

∂θG(x, y, θ) dθ

≤ 2π

∫ π/2

0

expπ

2∂

∂θG(x, y, θ)

dθ.

Hence

E expF ≤ 2π

∫ π/2

0

E expπ

2∂

∂θG(x, y, θ)

dθ.

Now (note that we switched two integrals in the process).Now,

∂θG(x, y, θ) =〈∇f(x′), y′〉 where

x′ = (sin θ)x+ (cos θ)y and y′ = (cos θ)x− (sin θ)y.

In words: we compute the gradient of f at the point x′ = (sin θ)x + (cos θ)y andtake its scalar product with the vector y′ = (cos θ)x− (sin θ)y.

Here is the punch-line: we want to compute the expectation of

expπ

2〈∇f(x′), y′〉

.

However, the vector (x′, y′) is obtained from the vector (x, y) by a rotation. Since(x, y) has the standard Gaussian distribution, (x′, y′) also has the standard Gauss-ian distribution. In particular, x′ and y′ are independent. Therefore, we can firsttake the expectation with respect to y′ and then with respect to x′. But with respectto y′, we are dealing with the exponentiation of a linear function. Therefore,

Ey′ expπ

2〈∇f(x′), y′〉

= exp

π2

8‖∇f(x′)‖2

.

65

Page 66: Measure Concentration

In the end, we get

E expF ≤ Ex′2π

∫ π/2

0

expπ2

8‖∇f(x′)‖2

dθ = E exp

π2

8‖∇f‖2

,

as claimed

(15.2) Corollary. Let f : Rn −→ R be a 1-Lipschitz function and let a = Ef bethe average value of f . Then

γn

x : f(x)− a ≥ t

≤ exp

−2t2

π2

and

γn

x : f(x)− a ≤ −t

≤ exp

−2t2

π2

.

Proof. Without loss of generality, we assume that f is smooth and that ‖∇f‖ ≤ 1.We apply the Laplace transform method. To prove the first inequality, we choose

a λ ≥ 0 and claim that

γn

x : f(x)− a ≥ t

≤ e−λtEeλ(f−a) ≤ e−λt exp

π2

8λ2,

where the last inequality follows from Theorem 15.1 applied to λ(f − a). Now weoptimize on λ by substituting

λ =4tπ2.

The second inequality is proved in a similar way.

If we ask ourselves, what are the “internal reasons” why the proof of Theorem15.1 worked, the following picture seems to emerge. Instead of proving that f :X −→ R concentrates about any particular number, say, its expectation or median,we prove that f concentrates “by itself”, that is, that |f(x)−f(y)| is small typically.For that, we double the space X to its direct product X×X with itself. This givesus more freedom to choose a path connecting x and y. We show that if we choosea path Γ the right way, the gradient of f will be more or less independent on thedirection of Γ, so for typical x and y the change |f(x)− f(y)| will not be that big.

(15.3) Concentration on the sphere as a corollary. One can obtain a con-centration result for 1-Lipschitz functions on the unit sphere Sn−1 by thinking ofthe invariant probability measure µn on Sn−1 as the push-forward of the Gaussianmeasure γn via the radial projection Rn \ 0 −→ Sn−1. If f : Sn−1 −→ R is a 1-Lipschitz function with Ef = 0 then f : Rn −→ R defined by f(x) = ‖x‖f(x/‖x‖)is a 3-Lipschitz on Rn.

Lecture 21. Wednesday, February 23

66

Page 67: Measure Concentration

16. Smoothing Lipschitz functions

In Section 15, we mentioned that any 1-Lipschitz function f : Rn −→ R canbe approximated by a smooth 1-Lipschitz function. In this section, we describe asimple way to do it.

(16.1) Theorem. Let g : Rn −→ R be a 1-Lipschitz function. For ε > 0 let

Bε(x) =y : ‖y − x‖ ≤ ε

be the ball of radius ε centered at x. Let us define f : Rn −→ R by

f(x) =1

volBε(x)

∫Bε(x)

g(y) dy

(in words: f(x) is the average of g over the ball of radius ε centered at x). Then(1) Function f is smooth and

‖∇f(x)‖ ≤ 1 for all x ∈ Rn;

(2) We have

|f(x)− g(x)| ≤ εn

n+ 1≤ ε for all x ∈ Rn.

Proof. We check the case of n = 1 first when computations are especially simple.In this case, g : R −→ R is a 1-Lipschitz function and f : R −→ R is defined by

f(x) =12ε

∫ x+ε

x−εg(y) dy.

Therefore, we have

f ′(x) =g(x+ ε)− g(x− ε)

2ε.

Since g is 1-Lipschitz, we have |f ′(x)| ≤ 1 for all x ∈ R and

|f(x)− g(x)| =∣∣∣ 12ε

∫ x+ε

x−εg(y) dy − 1

∫ x+ε

x−εg(x) dy

∣∣∣≤ 1

∫ x+ε

x−ε|g(y)− g(x)| dy ≤ 1

∫ x+ε

x−ε|y − x| dy

=ε2

2ε=ε

2.

67

Page 68: Measure Concentration

Suppose now that n > 1. Let us choose a unit vector u ∈ Rn. Strictly speaking,we will not prove that f is smooth, although it can be deduced from our construc-tion. Instead, we are going to prove that f is differentiable in the direction of uand that the derivative of f in the direction of u does not exceed 1 in the absolutevalue, that is,

|〈∇f, u〉| ≤ 1.

Without loss of generality, we assume that x = 0 and that u = (0, . . . , 0, 1). Thenthe orthogonal complement u⊥ can be identified with Rn−1.

LetDε =

z ∈ Rn−1 : ‖z‖ < ε

be the open ball centered at 0 of radius ε in Rn−1. Then, for any z ∈ Dε, theintersection of the line in the direction of u through z and Bε = Bε(0) is theinterval with the endpoints

(z,−a(z)

)and

(z, a(z)

), where a(z) > 0.

We note that

(16.1.1) volBε =∫Dε

2a(z) dz.

Let us estimate the derivative of f in the direction of u, that is 〈∇f, u〉. For anyτ ∈ R, we have

f(τu) =1

volBε

∫Bε+τu

g(z) dz =1

volBε

∫Dε

(∫ a(z)+τ

−a(z)+τg(z, ξ)dξ

)dz

and∂f(τu)∂τ

∣∣∣τ=0

=1

volBε

∫Dε

g(z, a(z)

)− g(z,−a(z)

)dz.

Since g is 1-Lipschitz, we have |g(z, a(z)

)− g(z,−a(z)

)| ≤ 2a(z). Therefore, in

view of (16.1.1),∂f(τu)∂τ

∣∣∣τ=0

= 〈∇f, u〉 ≤ 1

and since u was arbitrary, we get Part (1).To prove Part (2), we write

|f(0)− g(0)| =∣∣∣ 1volBε

∫Bε

g(y)− g(0) dy∣∣∣ ≤ 1

volBε

∫Bε

|g(y)− g(0)| dy

≤ 1volBε

∫Bε

‖y‖ dy =|Sn−1|volBε

∫ ε

0

rn dr =ε|Sn−1|

(n+ 1) volB1.

It is not hard to show that|Sn−1| = n volB1,

68

Page 69: Measure Concentration

from which Part (2) follows.

PROBLEM. Prove that

‖∇f(x)−∇f(y)‖ ≤ c√n

ε‖x− y‖

for some absolute constant c > 0.

17. Concentration on the sphere andfunctions almost constant on a subspace

We discuss one notable corollary of the concentration on the sphere. It claimsthat for any fixed ε > 0 and any 1-Lipschitz function f : Sn−1 −→ R there exists asubspace L ⊂ Rn of dimension linear in n such that the restriction of f onto L isε-close to a constant on the unit sphere L ∩ Sn−1 in L.

(17.1) Theorem. There exists an absolute constant κ > 0 with the followingproperty. For any ε > 0, for any n, for any 1-Lipschitz function f : Sn−1 −→ R,there exists a number c (which can be chosen to be the median of f or the averagevalue of f) and a subspace L ⊂ Rn such that

|f(x)− c| ≤ ε for all x ∈ L ∩ Sn−1

and

dimL ≥ κε2

ln(1/ε)n.

One crucial argument in the proof is the existence of moderately sized δ-nets. Itis interesting in its own right.

(17.2) Lemma. Let V be an n-dimensional normed space with norm ‖ · ‖ and let

Σ =x ∈ V : ‖x‖ = 1

be the unit sphere in V . Then, for any δ > 0 there exists a set S ⊂ Σ such that

(1) For every x ∈ Σ there is a y ∈ S such that ‖x− y‖ ≤ δ, so S is a δ-net inΣ;

(2) We have

|S| ≤(

1 +2δ

)nfor the cardinality |S| of S.

Similarly, for the unit ball

B =x ∈ V : ‖x‖ ≤ 1

there exists a δ-net S ⊂ B of cardinality |S| ≤

(1 + 2

δ

)n.69

Page 70: Measure Concentration

Proof. Let us choose S ⊂ Σ to be the maximal subset with the property that‖y1 − y2‖ > δ for any two points y1, y2 ∈ S. In other words, we cannot includeinto S any additional point x ∈ Σ without this property being violated. Clearly, Smust be a δ-net for Σ. For every y ∈ S, let us consider a ball B(y, δ/2) of radiusδ/2 centered at y. Then the balls are pairwise disjoint:

B(y1, δ/2) ∩B(y2, δ/2) = ∅ provided y1 6= y2.

Moreover, each ball B(y, δ/2) lies within the ball centered at the origin and ofradius 1 + δ/2 = (2 + δ)/2, so

⋃y∈S

B(y, δ/2) ⊂ B

(0, 1 +

δ

2

)= B

(0,

2 + δ

2

).

Estimating the volume of the union, we get that

|S| volB(0, δ/2) ≤ volB(

0,2 + δ

2

).

Since the volume of the n-dimensional ball is proportional to the nth power of theradius, we get

|S|(δ

2

)n≤(

2 + δ

2

)n,

from which (2) follows.

Now we are ready to prove Theorem 17.1.

Proof. Let us choose a k-dimensional subspace A ⊂ Rn (k will be adjusted later).The goal is to prove that for a random orthogonal transformation U ∈ On, the“rotated” subspace L = U(A) =

Ux : x ∈ A

will satisfy the desired properties.

For that, let us choose an ε/2-net S ⊂ A ∩ Sn−1. As follows from Lemma 17.2, wecan choose S such that

|S| ≤(1 + 4ε−1

)k= exp

k ln(1 + 4ε−1)

.

LetX =

x ∈ Sn−1 : |f(x)− c| ≤ ε/2

.

As follows from the concentration results (cf. Theorem 14.3 or Section 15.3),

µ(X) ≥ 1− e−αnε2

for some absolute constant α > 0.70

Page 71: Measure Concentration

If we manage to find an orthogonal transformation U such that U(S) ⊂ X thenfor L = U(A) the restriction of f onto L ∩ Sn−1 does not deviate from c by morethan ε. Indeed, for every x ∈ L∩ Sn−1 there is a y ∈ U(S) such that ‖y− x‖ ≤ ε/2(since U(S) is an ε/2-net for L ∩ Sn−1) and |f(y) − c| ≤ ε/2 (since y ∈ X). Sincef is 1-Lipschitz, we get that |f(x)− c| ≤ ε.

Now, since X is pretty big and S is reasonably small, one can hope that a randomorthogonal transformation U will do.

Let ν = νn be the Haar probability measure on the orthogonal group On (cf.Section 3.5). Let us pick a particular x ∈ S. As U ranges over the group On, thepoint Ux ranges over the unit sphere Sn−1. Therefore,

νU : Ux /∈ X

= µ

(Sn−1 \X

)≤ e−αnε

2

(we used a similar reasoning in Section 3). Therefore,

νU : Ux /∈ X for some x ∈ S

≤ |S|e−αnε

2≤ exp

k ln(1 + 4ε−1)− αnε2

.

We can make the upper bound less than 1 by choosing

k = O

(ε2n

ln(1/ε)

).

Lecture 22. Friday, February 25

18. Dvoretzky’s Theorem

We discuss one of the first and most famous applications of measure concen-tration, Dvoretzky’s Theorem, conjectured by A. Grothendieck in 1956, proved byA. Dvoretzky in 1961, reproved by V. Milman in 1971 using the concentration ofmeasure on the unit sphere, and by T. Figiel, J. Lindenstrauss, and V. Milmanwith better constants and broad extensions and ramifications in 1977.

(18.1) Definition. Let V be an n-dimensional normed space with norm ‖ · ‖.We say that V is ε-close to Euclidean space Rn if there exists a vector spacesisomorphism φ : V −→ Rn such that

(1− ε)‖φ(x)‖Rn ≤ ‖x‖V ≤ (1 + ε)‖φ(x)‖Rn , for all x ∈ V

where ‖x‖V is measured with respect to the norm in V and ‖φ(x)‖Rn is measuredwith respect to the Euclidean norm in Rn.

First we state an infinite-dimensional version of the theorem.71

Page 72: Measure Concentration

(18.2) Theorem. Let W be an infinite-dimensional normed space with norm ‖ · ‖and let ε > 0 be a number. Then for any positive integer n there exists a subspaceV ⊂ W with dimV = n and such that the space V with the norm ‖ · ‖ inheritedfrom W is ε-close to Euclidean space Rn.

This does sound counterintuitive: for example, choose W to be the space ofcontinuous functions f : [0, 1] −→ R with the norm

‖f‖ = max0≤x≤1

|f(x)|

choose ε = 0.1 and try to present an n-dimensional subspace V of W ε-close toEuclidean space.

Theorem 18.2 is deduced from its finite-dimensional version.

(18.3) Theorem. For any ε > 0 and any positive integer k there exists a positiveinteger N = N(k, ε) such that for any normed space W with dimW ≥ N thereexists a k-dimensional subspace V of W which is ε-close to Euclidean space Rk.

Currently, the best bound for N is N = expO(ε−2k

). We will not prove

Dvoretzky’s Theorem but explain how one could get a slightly weaker bound N =expO

(ε−2 ln(1/ε)k

). The exponential dependence on k is optimal, the corre-

sponding example is given by W = `∞N consisting of all N -tuples x = (ξ1, . . . , ξN )of real numbers with the norm

‖x‖∞ = maxi=1,... ,N

|ξi|.

The main thrust of the proof is Theorem 17.1.

(18.4) A plan of the proof of Dvoretzky’s Theorem. Let W be an n-dimensional normed space with norm ‖ · ‖W . Our goal is to find a k-dimensionalsubspace V ⊂W which is ε-close to Euclidean space.

Step 1. Let us round. We may think of W as of Rn endowed with the norm‖ · ‖W as opposed to the standard Euclidean norm ‖ · ‖. This way, we can freelyoperate with the standard Euclidean scalar product 〈x, y〉 in W .

An ellipsoid E centered at a ∈ Rn is a set of the type

E =x ∈ Rn : q(x− a) ≤ 1

,

where q : Rn −→ R is a positive definite quadratic form. A famous result dueto F. John states that every convex body K contains a unique ellipsoid of themaximum volume and is contained in a unique ellipsoid of the minimum volume. Ifthe convex body is symmetric about the origin, that is, K = −K, then the situationis especially attractive: in this case, both ellipsoids have to be symmetric and so

72

Page 73: Measure Concentration

have to be centered at the origin. If E ⊂ K is the maximum volume ellipsoid thenK ⊂

√nE and if E ⊃ K is the minimum volume ellipsoid then n−1/2E ⊂ K (if K

is not symmetric, we should dilate by a factor of n in the worst case).Let

K =x ∈ Rn : ‖x‖W ≤ 1

be the unit ball of the norm ‖ · ‖W . We find the maximum volume ellipsoid E ⊂ Kand choose a new scalar product in Rn in which E is the standard unit ball B:

B =x ∈ Rn : ‖x‖ ≤ 1

.

Hence we have B ⊂ K ⊂√nB. After this step, we can be more specific: we will

look for a k-dimensional subspace V ⊂ Rn for which the restriction of ‖ · ‖W ontoV is ε-close to the restriction of the Euclidean norm ‖ · ‖ onto V .

Step 2. Let us dualize. Let

K =x ∈ Rn : 〈x, y〉 ≤ 1 for all y ∈ D

be the polar dual of D. The standard duality argument implies that

‖x‖W = maxy∈K

〈x, y〉.

Besides,n−1/2B ⊂ K ⊂ B.

Step 3. Let us estimate the average or the median. Let us consider the norm‖x‖W as a function on the unit sphere Sn−1. We want to obtain a lower boundfor its median m or average a on the sphere. Since K contains the ball of radiusn−1/2, the median is at least n−1/2, but this bound is too weak for our purposes.A stronger (though still correct) bound is

m,a ≥ α

√lnnn

for some absolute constant α > 0. This is deduced from Dvoretzky-Rogers Lemma,which we don’t discuss here. Instead, we discuss some intuitive reasons where anextra

√lnn appears from.

One can argue that B is the smallest volume ellipsoid containing K. Therefore,there must be some points of K on the boundary of B, and, in some sense, theyshould be spread sufficiently evenly on the surface of B. If, for example, ∂B ∩K

73

Page 74: Measure Concentration

contains an orthonormal basis (which happens, for example, if K is a cube so K

is an octahedron), then

‖x‖W ≥ maxi=1,... ,n

|ξi| for x = (ξ1, . . . , ξn)

and the average value of ‖xW ‖ is at least as large as∫Sn−1

maxi=1,... ,n

|ξi| dµn(x).

It is not hard to argue that the integral is of the order of

c

√lnn

for some positive c. Perhaps the easiest way to deduce this is by passing to theGaussian measure and proving that∫

Rn

maxi=1,... ,n

|ξi| dγn(x) ∼√

lnn as n −→ +∞.

In general, it is not true that K contains n orthonormal vectors on the boundaryof B, but something close to it is true. F. John’s criterion for the optimality of theminimum volume ellipsoid states that there are vectors xi ∈ ∂B ∩K, i ∈ I suchthat ∑

i∈Iλixi ⊗ xi = I,

where λi ≥ 0, x ⊗ x denotes the square matrix with the entries ξiξj for x =(ξ1, . . . , ξn) and I is the identity matrix. This turns out to be enough to produceenough (in fact, about n/2 ) orthonormal vectors xi ∈ K

i that are sufficiently long(‖xi‖ ≥ 1/2). This is what Dvoretzky-Rogers Lemma is about, and this is enoughto establish the lower bound.

Step 4. Let us apply (modified) Theorem 17.1. Considering ‖x‖W as a functionon the unit sphere Sn−1, we would like to use Theorem 17.1 to claim that there existsa section of the sphere of a moderately high dimension such that the restriction of‖x‖W onto that section is almost a constant. This would do the job. However,if we just view ‖x‖W as a 1-Lipschitz function (which it is) without any specialproperties (which it does have) and just apply Theorem 17.1, we’ll get nothing.

We will go back to the proof of Theorem 17.1 and use that ‖x‖W is homoge-neous of degree 1 and convex with the goal of replacing the additive error by themultiplicative error.

Let a be the average or the median of ‖x‖W on Sn−1, so

a = Ω

(√lnn

).

74

Page 75: Measure Concentration

We are given an ε > 0 and an integer k. We would like to prove that if n is largeenough, there will be a k-dimensional subspace L for which the restriction of ‖ · ‖Wonto L is ε-close to the average. Let us choose a k-dimensional subspace A ⊂ Rnand consider the intersection Sk−1 = A∩ Sn−1. Let us choose a (small) δ = δ(ε) tobe adjusted later and let us construct a net S ⊂ Sk−1 such that

conv(S) ⊂ Sk−1 ⊂ (1 + δ) conv(S),

where conv is the convex hull. If we manage to rotate A 7−→ L = U(A) by anorthogonal transformation U in such a way that

(1− δ)a ≤ ‖U(x)‖W ≤ (1 + δ)a for all x ∈ S,

we’ll have(1− δ)a ≤ ‖x‖W ≤ (1 + δ)2a for all x ∈ L.

Thus we should choose δ ≤ ε such that (1 + δ)2 < (1 + ε).The net S, however large, is finite. Going back to the proof of Theorem 17.1,

we need to estimate the probability, that for every particular x ∈ X, for a randomorthogonal transformation U we don’t have

(1− δ)a ≤ ‖U(x)‖W ≤ (1 + δ)a.

From the concentration inequality on the sphere, this probability is at most

exp−Ω(nδ2a2)

= exp

−Ω

(nδ2

lnnn

)= exp

−Ω(lnn)

,

that is, tends to 0 as n grows. This implies that for a sufficiently large n, we willbe able to produce such a transformation U .

Lecture 23. Monday, March 7

18. Dvoretzky’s Theorem, continued

The following simple estimate shows that the dimension of an “almost Euclideansection” V ⊂W cannot grow faster than log dimW in the worst case.

(18.5) Lemma. Let B ⊂ Rn+2 be the unit ball and let P ⊂ Rn+2 be a polyhedrondefined by m inequalities

P =x ∈ Rn+2 : 〈ui, x〉 ≤ αi for i = 1, . . . ,m

.

75

Page 76: Measure Concentration

Suppose that ‖ui‖ = 1 for i = 1, . . . ,m and that αi ≥ 1, so B ⊂ P . Supposefurther, that P ⊂ ρB for some ρ ≥ 1. Then

2ρ2 lnm ≥ n.

Proof. Let us choose a t > 0 such that me−t2n/2 < 1 and let us consider the

spherical caps of radius π/2 − t centered at ui for i = 1, . . . ,m. As follows byLemma 14.2, these caps fail to cover the whole sphere. Therefore, there is a pointv ∈ Sn+1 such that

〈v, ui〉 < cos(π

2− t)

= sin t < t for i = 1, . . . ,m.

Therefore, t−1v ∈ P and so t−1 ≤ ρ.Now, we can choose any

t ≥√

2 lnmn

and so √n

2 lnm≤ ρ,

which completes the proof.

(18.6) Almost Euclidean sections of `∞. Let W be the space Rn albeit withthe norm

‖x‖W = maxi=1,... ,n

|ξi| for x = (ξ1, . . . , ξn).

Such a space W is called the `∞n space. The unit ball

P =x ∈W : ‖x‖W ≤ 1

is the cube |ξi| ≤ 1 defined by 2n inequalities. If we manage to find an m-dimensional subspace V ⊂ W which is ε-close to Euclidean space, then, after alinear transformation, the polyhedron P ∩ V contains the Euclidean unit ball Band is contained in ρB for ρ = (1 + ε)/(1− ε) ≈ 1 + 2ε for ε ≈ 0. Applying Lemma18.5, we get

2ρ2 ln(2n) ≥ m+ 2.

Thus we must havem = O(lnn).

Thus the dimension of an almost Euclidean section cannot grow faster than thelogarithm of the dimension of the ambient space in the worst case. However, some-what disappointingly, we are unable to recover the worst possible dependence onε.

76

Page 77: Measure Concentration

PROBLEM. Let W be the space Rn with the norm

‖x‖W =n∑i=1

|ξi| for x = (ξ1, . . . , ξn).

Such a space W is called the `1n space. Let Sn−1 ⊂ Rn be the Euclidean unit spherein Rn. Let c be the average value of ‖x‖W on Sn−1. Prove that c =

√2n/π and

deduce from Theorem 17.1 that for any ε > 0 space W has a subspace V ⊂ W ,which is ε-close to Euclidean space and such that dimV = Ω(n).

In fact, there is a subspace of dimension n/2 which is constant-close to Eu-clidean space. This cannot be deduced from Theorem 17.1 and requires a differenttechnique.

Lecture 24. Wednesday, March 9

19. The Prekopa-Leindler Inequality

We prove a very useful inequality, called the Prekopa-Leindler inequality, whichallows us to establish concentration in a wide variety of situations.

(19.1) Theorem. Let f, g, h : Rn −→ R be non-negative integrable functions andlet α, β > 0 be numbers such that α+ β = 1 and

h(αx+ βy) ≥ fα(x)gβ(y) for all x, y ∈ Rn.

Then ∫Rn

h dx ≥(∫

Rn

f(x) dx)α(∫

Rn

g(x) dx)β

.

Proof. The proof is by induction on the dimension n.Suppose that n = 1. We may assume that f(x) > 0 and g(x) > 0 for all x ∈ R.

Scaling, if needed, we assume that∫ +∞

−∞f(x) dx =

∫ +∞

−∞g(x) dx = 1.

This allows us to think of f and g as densities of probability distributions withstrictly increasing cumulative distribution functions

F (t) =∫ t

−∞f(x) dx and G(t) =

∫ t

−∞g(x) dx.

77

Page 78: Measure Concentration

Thus, F,G : R −→ (0, 1). Let u(t) be the inverse of F and v(t) be the inverse of G,so u, v : (0, 1) −→ R and∫ u(t)

−∞f(x) dx = t and

∫ v(t)

−∞g(x) dx = t for t ∈ (0, 1).

Besides, u(t) and v(t) are both smooth and increasing. Differentiating both inte-grals, we get

(19.1.1) u′(t)f(u(t)

)= v′(t)g

(v(t)

)= 1.

Letw(t) = αu(t) + βv(t).

Then w(t) is smooth and increasing. Furthermore,

w′(t) = αu′(t) + βv′(t) ≥(u′(t)

)α(v′(t)

)β,

since ln is a concave function. In particular, w : (0, 1) −→ R is a smooth increasingfunction. Let us make the substitution x = w(t) in the integral of h over R. Wecan write∫ +∞

−∞h(x) dx =

∫ 1

0

h(w(t)

)w′(t) dt ≥

∫ 1

0

f(u(t)

)αg(v(t)

)β(u′(t)

)α(v′(t)

)βdt

=∫ 1

0

(f(u(t)

)u′(t)

)α (g(v(t)

)v′(t)

)βdt = 1,

where we used (19.1.1) in the last equality. This proves the inequality in the caseof n = 1.

Suppose that n > 1. Let us slice Rn into flats ξn = τ . Each such a flat can beidentified with Rn−1. Let us define

f1(τ) =∫

Rn−1f(y, τ) dy, g1(τ) =

∫Rn−1

g(y, τ) dy, and

h1(τ) =∫

Rn−1h(y, τ) dy,

where dy is the Lebesgue measure on Rn−1. Then f1, g1, and h1 are univariate non-negative integrable functions. Let us fix some τ1, τ2 ∈ R. Then for any y1, y2 ∈ Rn−1

we haveh(αy1 + βy2, ατ1 + βτ2) ≥ f(y1, τ1)αg(y2, τ2)β .

Applying the induction hypothesis to the (n− 1)-variate functions h(·, ατ1 + βτ2),f(·, τ1), and g(·, τ2), we get

h1(ατ1 + βτ2) ≥ (f1(τ1))α (g1(τ2))β .78

Page 79: Measure Concentration

Therefore, by the already proven univariate case,∫ +∞

−∞h1(τ) dτ ≥

(∫ ∞

−∞f1(τ) dτ

)α(∫ +∞

−∞g1(τ) dτ

)β.

However, by Fubini’s Theorem,∫ +∞

−∞h1(τ) dτ =

∫Rn

h(x) dx,∫ ∞

−∞f1(τ) dτ =

∫Rn

f(x) dx, and∫ ∞

−∞g1(τ) dτ =

∫Rn

g(x) dx.

This completes the proof.

20. Logarithmically concave measures.The Brunn-Minkowski inequality

We start with some definitions.

(20.1) Definitions. A function p : Rn −→ R is called logarithmically concave orlog-concave if it is non-negative: p(x) ≥ 0 for all x ∈ Rn and

p(αx+ βy) ≥ pα(x)pβ(y) for all α, β ≥ 0 such that α+ β = 1and all x, y ∈ Rn.

We agree that 00 = 1. Equivalently, p is log-concave if p(x) ≥ 0 and ln f is concave.A measure µ on Rn is called log-concave if µ has a density which is a measurable

log-concave function f . Examples of log-concave measures include the Lebesguemeasure dx and the standard Gaussian measure γn.

Let A,B ⊂ Rn be sets. The set

A+B =x+ y : x ∈ A and y ∈ B

is called the Minkowski sum of A and B.

For a X ⊂ Rn, let [X] : Rn −→ R be the indicator function of X:

[X](x) =

1 if x ∈ X0 if x /∈ X.

PROBLEMS.1. Prove that the product of log-concave functions is log-concave.2. Let A ⊂ Rn be a convex body (a compact convex set with non-empty interior).

Let us define a measure ν on Rn by ν(X) = vol(X∩A). Prove that ν is a log-concavemeasure.

79

Page 80: Measure Concentration

3. Prove that the Minkowski sum of convex sets is a convex set.4. Let A,B ⊂ Rn be sets and let α be a number. Prove that α(A+B) = αA+αB.5. Let A be a convex set and let α, β ≥ 0 be numbers. Prove that (α + β)A =

αA + βA. Prove that in general the convexity and non-negativity assumptionscannot be dropped.

The following is the famous Brunn-Minkowski inequality.

(20.2) Theorem. Let µ be a log-concave measure on Rn, let α, β ≥ 0 be numberssuch that α+β = 1, and let A,B ⊂ Rn be measurable sets such that the set αA+βBis measurable. Then

µ(αA+ βB) ≥ µα(A)µβ(B) for all α, β ≥ 0 such that α+ β = 1.

Proof. Let p be the log-concave density of µ. Let us define functions f, g, h : Rn −→R by f = p[A], g = p[B], and h = p[αA+ βB] (we mean the product of the densityand indicator functions). Then

h(αx+ βy) ≥ fα(x)gβ(y) for all α, β ≥ 0 such that α+ β = 1and all x, y ∈ Rn.

Now we apply the Prekopa-Leindler Inequality (Theorem 19.1) and notice that∫Rn

f(x) dx = µ(A),∫

Rn

g(x) dx = µ(B), and∫

Rn

h(x) dx = µ(αA+ βB).

In the classical case of the Lebesgue measure µ, the Brunn-Minkowski inequalityis often stated in the equivalent additive form:

(20.3) Corollary. Let µ be the Lebesgue meausre on Rn and let A,B ⊂ Rn bemeasurable sets such that αA+ βB is measurable for all α, β ≥ 0. Then

µ1/n(A+B) ≥ µ1/n(A) + µ1/n(B).

Proof. Without loss of generality, we assume that µ(A), µ(B) > 0. Let

α =µ1/n(A)

µ1/n(A) + µ1/n(B)and β =

µ1/n(B)µ1/n(A) + µ1/n(B)

,

so α, β ≥ 0 and α+ β = 1. Let A1 = α−1A and B1 = β−1B. Then

µ(A1) = α−nµ(A) =(µ1/n(A) + µ1/n(B)

)nand

µ(B1) = β−nµ(B) =(µ1/n(A) + µ1/n(B)

)n.

80

Page 81: Measure Concentration

Here we used that the Lebesgue measure is homogenous of degree n, that is, µ(tA) =tnµ(A). Applying Theorem 20.2, we get

µ(A+B) = µ(αA1 + βB1) ≥ µα(A1)µβ(B1) =(µ1/n(A) + µ1/n(B)

)n.

The proof now follows.

PROBLEM. Let µ be a measure on Rn such that µ(tA) = tnµ(A) for all measur-able A ⊂ Rn and all t ≥ 0. Let α, β ≥ 0 be numbers such that α+ β = 1. Supposefurther that µ1/n(αA + βB) ≥ µ1/n(αA) + µ1/n(βB) for some measurable A andB such that αA+ βB is measurable. Prove that µ(αA+ βB) ≥ µα(A)µβ(B).

Lecture 25. Friday, March 11

21. The isoperimetric inequality for the Lebesgue measure in Rn

The Brunn-Minkowski inequality provides a simple solution for the isoperimetricproblem for the Lebesgue measure in Euclidean space Rn. For a closed set A ⊂ Rnand a number ρ ≥ 0, let us define the ρ-neighborhood Aρ of A by

A(ρ) =x : dist(x, y) ≤ ρ for some y ∈ A

.

Using Minkowski addition, we can write

A(ρ) = A+Bρ,

whereBρ =

x : ‖x‖ ≤ ρ

is the ball of radius ρ centered at the origin.

In this section, µ denotes the standard Lebesgue measure in Rn. The isoperi-metric inequality for µ states that among all sets of a given measure, the Euclideanball has the smallest measure of any ρ-neighborhood.

(21.1) Theorem. For a compact set A ⊂ Rn let Br ⊂ Rn be a ball such that

µ (Br) = µ(A).

Thenµ(A(ρ)

)≥ µ (Br+ρ) for any ρ ≥ 0.

Proof. Applying the Brunn-Minkowski inequality (Corollary 2.3), we get

µ1/n(A(ρ)

)= µ1/n

(A+Bρ

)≥ µ1/n(A) + µ1/n

(Bρ)

= µ1/n(Br)

+ µ1/n(Bρ).

81

Page 82: Measure Concentration

However, the volume of an n-dimensional ball is proportional to the nth power ofits radius. Therefore the volume1/n is a linear function of the radius. Hence

µ1/n(Br)

+ µ1/n(Bρ)

= µ1/n(Bρ+r

),

and the proof follows.

If A is “reasonable”, so that the limit

limρ−→0+

µ(A(ρ))− µ(A)

ρ

exists and can be interpreted as the surface area |∂A| of A, Theorem 21.1 impliesthat among all sets of a given volume, the ball has the smallest surface area.

22. The concentration on the sphereand other strictly convex surfaces

We apply the Brunn-Minkowski inequality to re-establish the measure concen-tration on the unit sphere (see Section 14), though with weaker constants. On theother hand, we’ll obtain concentration for more general surfaces than the sphere,namely strictly convex surfaces. Originally, the concentration on strictly convex sur-faces was obtained by M. Gromov and V. Milman via a different approach. Herewe present a simple proof due to J. Arias-de-Reyna, K. Ball, and R. Villa.

(22.1) Definition. Let K ⊂ Rn be a convex body (that is, a convex compact setwith a non-empty interior). Suppose that K contains the origin in its interior andlet S = ∂K be the surface of K. We say that K is strictly convex if for any ε > 0there exists a δ = δ(ε) > 0 such that whenever x, y ∈ S and dist(x, y) ≥ ε, we have(x + y)/2 ∈ (1 − δ)K. The function ε 7−→ δ(ε) is called a modulus of convexity ofK.

PROBLEM. Let B =x ∈ Rn : ‖x‖ ≤ 1

be a unit ball. Prove that B is

strictly convex and that one can choose

δ(ε) = 1−√

1− ε2

4≥ ε2

8for 0 ≤ ε ≤ 2.

Let S = ∂K be the surface of K. We introduce a probability measure ν on S asfollows: for a subset A ⊂ S, let

A =αx : x ∈ A and 0 ≤ α ≤ 1

82

Page 83: Measure Concentration

be the “pyramid” over A centered at the origin. We let

ν(A) =µ(A)µ(K)

,

where µ is the Lebesgue measure. Check that if K is the unit ball and S = ∂Kis the unit sphere then ν is the rotation invariant probability measure on S. Ingeneral, however, ν is not the (normalized) surface area of S induced from Rn. Ina small neighborhood of a point a ∈ S, the measure ν is roughly proportional tothe surface area times the distance from the tangent plane to the origin.

We claim that for strictly convex surfaces, the measure of an ε-neighborhood ofa set of measure 1/2 is almost 1. Note that we measure distances in the ambientspace Rn and not intrinsically via a geodesic on the surface.

(22.2) Theorem. Let K be a strictly convex body with a modulus of convexityδ = δ(ε). Let S = ∂K be the surface of K and let A ⊂ S be a set such thatν(A) ≥ 1/2. Then, for any ε > 0 such that δ(ε) ≤ 1/2, we have

νx ∈ S : dist(x,A) ≥ ε

≤ 2(1− δ(ε)

)2n ≤ 2e−2nδ(ε).

Proof. LetB =

x ∈ S : dist(x,A) ≥ ε

.

Then for all pairs x ∈ A and y ∈ B we have (x + y)/2 ∈ (1 − δ)K. Let A and Bbe the pyramids over A and B respectively, as defined above. We claim that for allx ∈ A and y ∈ B we have (x + y)/2 ∈ (1 − δ)K. Indeed, x = αx and y = βy forsome x ∈ A, y ∈ B and 0 ≤ α, β ≤ 1. Without loss of generality, we can assumethat α ≥ β and α > 0, so β/α = γ ≤ 1. Then

x+ y

2=αx+ βy

2= α

(x+ γy

2

)= α

(γx+ y

2+ (1− γ)

x

2

)=αγ

(x+ y

2

)+ α(1− γ)

x

2.

Therefore,

x+ y

2∈ αγ(1− δ)K + α(1− γ)(1− δ)K = α(1− δ)K ⊂ (1− δ)K,

cf. Problem 5 of Section 20.1.Hence

12A+

12B ⊂ (1− δ)K, and, therefore, µ

(12A+

12B

)≤ (1− δ)nµ(K).

83

Page 84: Measure Concentration

Now we apply the multiplicative form of the Brunn-Minkowski inequality (Theorem20.2) to claim that

µ

(12A+

12B

)≥ µ1/2(A)µ1/2(B).

Hence

ν(B) =µ(B)µ(K)

≤ (1− δ)2nµ(K)µ(A)

=(1− δ)2n

ν(A)≤ 2(1− δ)2n ≤ 2e−2nδ(ε).

In particular, if K = B is the unit ball, we get the inequality

νx ∈ S : dist(x,A) ≥ ε

≤ 2e−ε

2n/4,

which should be compared with Theorem 14.3. Note that in Section 14 we mea-sure distances intrinsically, via a geodesic in Sn−1. Here we measure distancesextrinsically, via a chord in the space Rn. We have

chord distance ≤ geodesic distance ≤ π chord distance, and

chord distance ≈ geodesic distance for close points

so the results are very much alike in spirit, although the constants we get hereare worse than those in Section 14. However, the result applies to more generalsurfaces.

One can modify the strict convexity definition as follows: let us fix a norm p(x)on Rn, so p(x) ≥ 0 for all x ∈ Rn and p(x) = 0 implies x = 0, p(αx) = |α|p(x) forall x ∈ Rn, and p(x+ y) ≤ p(x) + p(y) for all x, y ∈ Rn.

LetK =

x ∈ Rn : p(x) ≤ 1

be the unit ball of this norm. We say that K (and p) are strictly convex if for anyε > 0 there is a δ = δ(ε) > 0 such that if p(x) = p(y) = 1 and p(x − y) ≥ ε thenp((x+ y)/2

)≤ (1− δ).

Let S = ∂K be the surface of K as before, and let ν be the measure on S definedas before. Let us measure the distance between x, y ∈ K as p(x−y). Then Theorem22.2 holds in this modified situation.

Lecture 26. Monday, March 14

84

Page 85: Measure Concentration

23. Concentration for the Gaussian measureand other strictly log-concave measures

In Section 22, we applied the Brunn-Minkowski inequality for the Lebesgue mea-sure in Rn to recover concentration on the unit sphere and strictly convex surfaces.Now we apply the Prekopa-Leindler inequality of Section 19 to recover concentra-tion for the Gaussian measure with some extensions. The inequalities we provehave non-optimal constants, but they are easy to deduce and they extend to a wideclass of measures.

The proof is by the Laplace transform method, for which we need a (by nowstandard) estimate.

(23.1) Theorem. Let µ be a probability measure in Rn with the density e−u, whereu : Rn −→ R is a function satisfying

u(x) + u(y)− 2u(x+ y

2

)≥ c‖x− y‖2

for some absolute constant c > 0. Let A ⊂ Rn be a closed set. Then∫Rn

expcdist2(x,A)

dµ ≤ 1

µ(A).

Proof. Let us define three functions f, g, h : Rn −→ R as follows:

f(x) = expcdist2(x,A)− u(x)

, g(x) = [A]e−u(x), and h(x) = e−u(x).

We are going to apply the Prekopa-Leindler inequality (Theorem 19.1) with α =β = 1/2. We have to check that

h

(x+ y

2

)≥ f1/2(x)g1/2(y).

Clearly, it suffices to check the inequality in the case when y ∈ A since otherwiseg(y) = 0. If y ∈ A then dist(x,A) ≤ ‖x− y‖ for all x ∈ Rn. Therefore, for y ∈ A,

f(x)g(y) = expcdist2(x,A)− u(x)− u(x)

≤ exp

c‖x− y‖2 − u(x)− u(y)

≤ exp

−2u

(x+ y

2

)= h2

(x+ y

2

),

which is what we need. Therefore, by Theorem 19.1,∫Rn

h(x) dx ≥(∫

Rn

f(x) dx)1/2(∫

Rn

g(x) dx)1/2

.

However, ∫Rn

h(x) dx = µ (Rn) = 1,∫

Rn

g(x) dx = µ(A), and∫Rn

f(x) dx =∫

Rn

expcdist2(x,A)

dµ,

and the proof follows.

An immediate corollary is a concentration inequality.85

Page 86: Measure Concentration

(23.2) Corollary. Let µ be a probability measure as in Theorem 23.1 and letA ⊂ Rn be a set such that µ(A) ≥ 1/2. Then, for any t ≥ 0, we have

µx ∈ Rn : dist(x,A) ≥ t

≤ 2e−ct

2.

Proof. Using the Laplace transform estimate (Section 2.1), we write

µx ∈ Rn : cdist2(x,A) ≥ ct2

≤e−ct

2∫

Rn

expcdist2(x,A)

≤2e−ct2.

The concentration for the Gaussian measure γn follows instantly. We choose

u(x) = ‖x‖2/2 + (n/2) ln(2π).

In this case,

u(x) + u(y)− 2u(x+ y

2

)=

2‖x‖2 + 2‖y‖2 − ‖x+ y‖2

4=‖x‖2 + ‖y‖2 − 2〈x, y〉

4

=‖x− y‖2

4,

so we can choose c = 1/4 in Theorem 23.1. This gives us the inequality

γn

x ∈ Rn : dist(x,A) ≥ t

≤ 2e−t

2/4 provided γn(A) ≥ 1/2,

which is a bit weaker than the inequalities of Section 14, but of the same spirit.

24. Concentration for general log-concave measures

Let us consider a log-concave probability measure µ on Rn, see Section 20. Infact, the only thing we need from µ is the Brunn-Minkowski inequality

µ(αA+ βB) ≥ µα(A)µβ(B) for α, β ≥ 0 such that α+ β = 1.

This is not quite the type of concentration we were dealing with so far. We takea sufficiently large convex body A ⊂ Rn, where “sufficiently large” means thatµ(A) > 1/2, “inflate” it A 7−→ tA for some t > 1 and see how much is left, that is,what is µ (Rn \ tA). We claim that what is left decreases exponentially with t.

86

Page 87: Measure Concentration

(24.1) Theorem. Let µ be a log-concave probability measure on Rn and let A ⊂ Rnbe a convex symmetric body (that is, −A = A). Then, for all t > 1,

µ (Rn \ tA) ≤ µ(A)(

1− µ(A)µ(A)

)(t+1)/2

.

Proof. Let

B = Rn \ (tA) and let α =t− 1t+ 1

and β =2

t+ 1.

Hence α, β ≥ 0 and α+ β = 1. Applying the Brunn-Minkowski inequality, we get

µ(αA+ βB) ≥ µα(A)µβ(B).

On the other hand, we claim that

αA+ βB ⊂ Rn \A.

Indeed, if there is an a ∈ A, a b ∈ B, and a c ∈ A such that αa+ βb = c, then

b = β−1(c− αa) =t+ 1

2c− t− 1

2a =

t+ 12

c+t− 1

2(−a) ∈ tA,

which is a contradiction. This gives us

1− µ(A) ≥ µα(A)µβ(B),

that is,

µ(B) ≤(

1− µ(A)µα(A)

)1/β

=(

1− µ(A)µ(t−1)/(t+1)(A)

)(t+1)/2

= µ(A)(

1− µ(A)µ(A)

)(t+1)/2

.

Suppose, for example, that µ(A) = 2/3. Then Theorem 24.1 states that

µ (Rn \ tA) ≤ 23

2−(t+1)/2,

so the measure of complement of tA indeed decreases exponentially. This is notquite the sort of decreasing we are used to, we’d prefer something of the typee−ct

2. However, this is the best we can get. The estimate of Theorem 24.1 is

“dimension-free”. Let us choose n = 1 and let µ be the symmetric exponentialmeasure with the density 0.5e−|x|. This is certainly a log-concave measure. Let uschoose, A = [−2, 2], say, so that µ(A) = 1−e−2 ≈ 0.865 > 2/3. Then tA = [−2t, 2t]and µ(A) = 1− e−2t and µ (R \A) = e−2t, so the measure outside of tA decreasesexponentially with t, as promised. This is due to the fact that µ is not strictlylog-concave. A typical application of Theorem 24.1 concerns norms.

87

Page 88: Measure Concentration

(24.2) Corollary. Let p : Rn −→ R be a norm and let µ be a log-concave measureon Rn. Let us choose a number r > 1/2 and let ρ be a number such that

µx ∈ Rn : p(x) ≤ ρ

= r.

Then, for all t > 1,

µx ∈ Rn : p(x) > tρ

≤ r

(1− r

r

)(t+1)/2

.

Proof. We apply Theorem 24.1 to

A =x : p(x) ≤ ρ

.

Lecture 27. Wednesday, March 16

25. An application: reverse Holder inequalities for norms

Let µ be a probability measure on Rn and let f : Rn −→ R be a non-negativeintegrable function. Holder inequalities state that(∫

Rn

fp dµ

)1/p

≤(∫

Rn

fq(x) dµ)1/q

provided q ≥ p > 0.

Furthermore, for p = 0, the above inequality reads

exp∫

Rn

ln f dµ≤(∫

Rn

fq(x) dµ)1/q

provided q > 0.

Indeed,fp = exp

p ln f

= 1 + p ln f +O

(p2 ln2 f

).

Assuming that ln2 f is integrable, we get(∫Rn

fp(x) dµ)1/p

=(

1 + p

∫Rn

(ln f) dµ+O(p2))1/p

= exp

1p

ln(

1 + p

∫Rn

ln f dµ+O(p2))

= exp∫

Rn

ln dµ

(1 +O(p)

)for p ≈ 0.

Suppose now that µ is log-concave and that f is a norm. In this case, theinequalities can be reversed up to some constants.

88

Page 89: Measure Concentration

(25.1) Theorem. Let µ be a log-concave probability measure on Rn and let f :Rn −→ R be a norm. Then(∫

Rn

fp dµ

)1/p

≤ cp

∫Rn

f dµ

for all p ≥ 1 and some absolute constant c > 0.

Proof. Without loss of generality, we assume that∫Rn

f dµ = 1.

Let us choose a ρ > 2. Since f is non-negative,

µx ∈ Rn : f(x) ≥ ρ

≤ 1ρ.

Thenµx ∈ Rn : f(x) ≤ ρ

≥ ρ− 1

ρ

and as in Corollary 24.2, we get

µx ∈ Rn : f(x) ≥ ρt

≤ ρ− 1

ρ

(1

ρ− 1

)(t+1)/2

≤(

1ρ− 1

)t/2for t > 1.

To make the computations simpler, let us choose ρ = 1 + e < 4. Thus

µx ∈ Rn : f(x) ≥ 4t

≤ e−t/2 for t > 1.

For t ≥ 0, letF (t) = µ

x ∈ Rn : f(x) ≤ t

be the cumulative distribution function of f . In particular,

1− F (t) ≤ e−t/8 for t > 4.

Then ∫Rn

fp dµ =∫ +∞

0

tp dF (t) = −∫ +∞

0

tp d(1− F (t)

)=tp

(1− F (t)

)∣∣t=+∞t=0

+∫ +∞

0

ptp−1(1− F (t)

)dt.

89

Page 90: Measure Concentration

Since 1−F (t) is exponentially decreasing as t grows, the first term is 0. Therefore,∫Rn

fp dµ =∫ +∞

0

ptp−1(1− F (t)

)dt ≤

∫ 4

0

ptp−1 dt+∫ +∞

4

ptp−1e−t/8 dt

≤4p + p

∫ +∞

0

tp−1e−t/8 dt = 4p + p8pΓ(p).

Since Γ(p) ≤ pp, the proof follows.

The proof of Theorem 25.1 is due to C. Borell.

PROBLEM. Deduce from Theorem 25.1 that for any p > q > 0 and someconstant c(p, q) > 0, we have(∫

Rn

fp dµ

)1/p

≤ c(p, q)(∫

Rn

fq dµ

)1/q

for any norm f .

The following extension of Theorem 25.1 was obtained by R. Lata la.

(25.2) Theorem. There exists an absolute constant c > 0 such that

ln(∫

Rn

f dµ

)≤ c+

∫Rn

(ln f

)dµ

for any log-concave probability measure µ on Rn and any norm f : Rn −→ R.

Note that Jensen’s inequality implies that∫Rn

(ln f

)dµ ≤ ln

(∫Rn

f dµ

).

Theorem 25.2 can be recast as(∫Rn

fp dµ

)1/p

≥ c

∫Rn

f dµ

for some other absolute constant c and all p > 0. As we remarked before, asp −→ 0+, the left hand side approaches

exp∫

Rn

(ln f

)dµ

).

If the main idea of the proof of Theorem 25.1 is to bound the measure outsideof a large ball, the main idea of the proof of Theorem 25.2 is to bound the measureinside a small ball in the norm f . More precisely, we want to prove that

µx ∈ Rn : f(x) ≤ t

≤ ct

for some absolute constant c and all sufficiently small t.The core of the argument is the following lemma.

90

Page 91: Measure Concentration

(25.3) Lemma. For a norm f : Rn −→ R, let

B =x ∈ Rn : f(x) ≤ 1

be the unit ball. Let us choose a δ > 0 and a ρ > 1 such that

µ(ρB) ≥ (1 + δ)µ(B).

Then for some c = c(ρ/δ) > 0, we have

µ(tB) ≤ ctµ(B) for all 0 < t < 1.

Proof. It suffices to prove Lemma 25.3 for any t of the form t = (2m)−1, where mis a positive integer.

Let us pick a particular m and let

µ

(1

2mB

)= κ(m)

µ(B)m

.

We must prove that κ = κ(m) ≥ 0 is bounded from above by a universal constantdepending only on the ratio ρ/δ.

The idea is to slice the ball B onto m concentric “rings” with the “core” beingthe ball of radius 1/2m. Using the conditions of the Lemma, we prove that themeasure of the outside ring is large enough. If the measure of the core is large,then by the Brunn-Minkowski inequality the measure of each ring should be largeenough. This would contradict to the fact that all those measures sum up to themeasure of the ball.

Without loss of generality, we can assume that

(25.3.1) κ(m) ≥ 2δρ

and hence µ

(1

2mB

)≥ 2δµ(B)

ρm>δµ(B)ρm

.

We haveµ(ρB \B) ≥ δµ(B).

For τ ≥ 0 let

Aτ =x ∈ Rn : τ − 1

2m< f(x) < τ +

12m

.

The interval [1, ρ] can be covered by a disjoint union of at most ρm non-overlappingintervals of length 1/m centered at points τ ∈ [1, ρ]. Therefore, there there is aτ ′ ≥ 1 such that

µ (Aτ ′) ≥δµ(B)ρm

.

91

Page 92: Measure Concentration

Now,

λAτ +(1− λ)

2mB ⊂ Aλτ for 0 < λ < 1,

from which by the Brunn-Minkowski inequality

(25.3.2) µ(Aλτ ) ≥ µλ(Aτ )µ1−λ(

12m

B

).

Choosing τ = τ ′ and λ = 1/τ ′ in (25.3.2) and using (25.3.1), we get

(25.3.3) µ (A1) ≥ δµ(B)ρm

.

Let us look at the sets A(m−1)/m, A(m−2)/m, . . . , A1/m (these are our “rings”) andA0 = (2m)−1B (the core). These sets are disjoint and lie in B. Combining (25.3.1)–(25.3.3), we get

µ(Ai/m

)≥ µ(B)

m

ρ

)(m−i)/m

κi/m.

Summing up over i = 1, . . . ,m, we get

µ(B) ≥ κµ(B)m

1− δ/ρκ

1− (δ/ρκ)1/m.

Now, κ = κ(m) ≥ 2δ/ρ, from which

µ(B) ≥ κ

2mµ(B)

11− (δ/ρκ)1/m

.

This gives us the inequality

κ ≤ 2m

(1−

ρκ

)1/m).

It remains to notice that for a < 1 and x > 0 we have

x(

1− a1/x)

= x(1− exp

x−1 ln a

)≤ x

(1− 1− x−1 ln a

)= ln a−1,

from which we get

κ(m) ≤ 2 ln(ρκ(m)δ

).

It follows now that κ(m) must be bounded by some constant depending on ρ/δonly.

Now we can prove Theorem 25.2.92

Page 93: Measure Concentration

Proof of Theorem 25.2. Without loss of generality, we assume that∫Rn

f dµ = 1,

so our goal is to prove that ∫Rn

(ln f) dµ ≥ c

for some absolute constant c.For t > 0, let

Bt =x ∈ Rn : f(x) ≤ t

denote the ball of radius t in the norm f .

Let us choose a ρ such that

µ (Bρ) = 2/3.

Recall that by our Definition 20.1, µ has a density, so such a ρ exists. This densityassumption is not crucial though.

We observe that by Markov inequality

µx ∈ Rn : f(x) ≥ 4

≤ 1

4,

so ρ ≤ 4.Next, by Corollary 24.2 with r = 2/3 and t = 3,

µ (B3ρ) ≥ 1− 23· 1

4=

56.

Rescaling the norm f 7−→ ρf and applying Lemma 25.3, we conclude that for someabsolute constant ω > 0 and some absolute constant c > 0, we have

µ (Bt) ≤ ct for t ≤ ω.

Without loss of generality, we assume that ω ≤ 1. This is enough to complete theproof. Denoting F (t) = µ (Bt), we have∫

Rn

(ln f) dµ =∫ +∞

0

(ln t) dF (t) ≥∫ 1

0

(ln t) dF (t) = (ln t)F (t)∣∣∣1t=0

−∫ 1

0

t−1F (t) dt.

Now, the first term is 0 since F (t) ≤ ct in a neighborhood of t = 0. Hence we haveto estimate the second term.∫ 1

0

t−1F (t) dt =∫ ω

0

t−1F (t) dt+∫ 1

ω

t−1F (t) dt

≤∫ ω

0

t−1(ct) dt+∫ 1

0

ω−1 dt = cω + ω−1,

93

Page 94: Measure Concentration

and the proof follows.

PROBLEM. Let γn be the standard Gaussian measure in Rn and let f : Rn −→ Rbe a positive definite quadratic form such that∫

Rn

f dγn = 1.

Prove that ∫Rn

(ln f) dγn ≥ c,

where c = −λ− ln 2 ≈ −1.27, where λ ≈ 0.577 is the Euler constant, that is,

λ = limn−→+∞

lnn−n∑k=1

1k.

The bound is the best possible.

Lecture 28. Friday, March 18

26. Log-concave measures as projections of the Lebesgue measure

Generally, we call a Borel probability measure µ on Rn (with density or not)log-concave if it satisfies the Brunn-Minkowski inequality

µ(αA+ βB) ≥ µα(A)µβ(B) where α, β ≥ 0 and α+ β = 1

and A,B ⊂ Rn are reasonable (say, closed) subsets, so that A,B, and A + B aremeasurable.

As we established in Section 20, measure with densities e−u(x), where u : Rn −→R are convex functions, are log-concave. This can be rephrased as “locally log-concave measures are globally log-concave”, since the condition of having a log-concave density is more or less equivalent to the Brunn-Minkowski inequality forsmall neighborhoods A and B of points.

The goal of this section is to present a number of simple geometric constructionsshowing that the class of all log-concave measures is a very natural object.

We start with a simple observation: the projection of a log-concave measure islog-concave.

94

Page 95: Measure Concentration

(26.1) Theorem. Let µ be a log-concave measure on Rn and let T : Rn −→ Rmbe a linear transformation. Then the push-forward ν = T (µ) defined by

ν(A) = µ(T−1(A)

)is a log-concave measure on Rm.

Proof. Let us choose A,B ⊂ Rm and let A1 = T−1(A) and B = T−1(B), soA1, B1 ⊂ Rn. Let us choose α, β ≥ 0 with α+ β = 1. We claim that

αA1 + βB1 ⊂ T−1(αA+ βB).

Indeed, if a ∈ A1 and b ∈ B1 then Ta ∈ A and Tb ∈ B so T (αa+βb) = αTa+βTb ∈αA+ βB. This implies that αa+ βb ∈ T−1(αA+ βB).

Therefore,

ν(αA+ βB) =µ(T−1(αA+ βB)

)≥ µ(αA1 + βB1) ≥ µα(A1)µβ(B1)

=να(A)νβ(B),

as claimed.

An important corollary of Theorem 26.1 is that the convolution of log-concavedensities is a log-concave density.

PROBLEMS.1. Let f, g : Rn −→ R be log-concave probability densities. Let h : Rn −→ R be

their convolutionh(x) =

∫Rn

f(y)g(x− y) dy.

Prove that h is a log-concave density.2. Check that T−1 (αA+ βB) = αT−1(A) + βT−1B.

One can think of h as of the density of the push-forward of the probabilitymeasure on R2n with the density f(x)g(y) under a linear transformation. Anothercorollary of Theorem 26.1 is that the sum of independent random variables withlog-concave distributions has a log-concave distribution.

The standard example of a log-concave measure is the Lebesgue measure re-stricted to a convex set. Namely, we fix a convex body K ⊂ Rn and let

µ(A) = vol(A ∩K) for A ⊂ Rn.

Theorem 26.1 tells us that by projecting such measures we get log-concave measures.A natural question is: what kind of a measure can we get by projecting the Lebesguemeasure restricted to a convex set. The answer is: pretty much any log-concavemeasure, if we allow taking limits.

95

Page 96: Measure Concentration

(26.2) Theorem. Let u : Rm −→ R be a convex function, let K ⊂ Rm be aconvex body and let us consider the measure ν with the density e−u[K] (that is,with the density e−u restricted to K). Then there exists a sequence of convex bodiesKn ⊂ Rdn and projections Tn : Rdn −→ Rm such that

T (µn) −→ ν as n −→ +∞,

where µn is the Lebesgue measure restricted to Kn and the density of T (µn) uni-formly converges to the density of ν.

Proof. Let dn = m+ n. We define the projection

Tn : Rm+n −→ Rm, (ξ1, . . . , ξm+n) 7−→ (ξ1, . . . , ξm) .

We represent vectors x ∈ Rm+n in the form x = (y, ξm+1, . . . , ξm+n), where y ∈ Rmand define Kn ⊂ Rm+n by

Kn =

(y, ξm+1, . . . , ξm+n) : y ∈ K and

0 ≤ ξi ≤ 1− u(y)n

for i = m+ 1, . . . ,m+ n.

Since u(y) is bounded on K, for all sufficiently large n we have u(y) < n for ally ∈ K and Kn are not empty for all sufficiently large n.

Next, we claim that Kn are convex, since u is convex. Finally, let us consider thepush-forward Tn(µn). Clearly, T (Kn) = K and the preimage of every point y ∈ Kis an n-dimensional cube with the side 1 − u(y)/n and the n-dimensional volume(1− u(y)/n)n. Since (

1− u(y)n

)n−→ e−u(y)

uniformly on K, the proof follows.

In view of Theorems 26.1 and 26.2, the Brunn-Minkowski inequality for anylog-concave density follows from the Brunn-Minkowski inequality for the Lebesguemeasure (though in a higher dimension). Wouldn’t it be nice to complete the circleby providing a fairly elementary proof of the Brunn-Minkowski inequality for theLebesgue measure?

(26.3) An elementary proof of the Brunn-Minkowski inequality for theLebesgue measure. As before, our main tool is the concavity of lnx:

ln(αx+ βy) ≥ α lnx+ β ln y for all α, β ≥ 0 such that α+ β = 1

and all x, y > 0. In the multiplicative form:

αx+ βy ≥ xαyβ .96

Page 97: Measure Concentration

Our first observation is that both sides of the inequality

vol(αA+ βB) ≥ volα(A) volβ(B)

do not change if we apply translations A 7−→ A+a, B 7−→ B+ b. Indeed, αA+βBgets translated by αa+ βb and volumes do not change under translations.

Next, we establish the inequality when both A and B are axis-parallel paral-lelepipeds, called “bricks”:

A =

(ξ1, . . . , ξn) : s′i ≤ ξi ≤ si for i = 1, . . . , n

and

B =

(ξ1, . . . , ξn) : t′i ≤ ξi ≤ ti for i = 1, . . . , n.

Translating, if necessary, we assume that s′i = t′i = 0. Then A+B is also a brick

αA+ βB =

(ξ1, . . . , ξn) : 0 ≤ ξi ≤ αsi + βti for i = 1, . . . , n.

In this case, the Brunn-Minkowski inequality is equivalent to the concavity of thelogarithm:

vol(αA+ βB) =n∏i=1

(αsi + βti) ≥n∏i=1

sαi tβi = volα(A) volβ(B).

Next, we prove the inequality for sets that are finite unions of non-overlappingbricks. We use induction on the total number of bricks in A and B.

Let k be the total number of bricks, so k ≥ 2. If k = 2 then each set consistsof a single brick, the case we have already handled. Suppose that k > 2. Thenone of the sets, say A, contains at least two bricks. These two bricks differ in atleast one coordinate, say ξ1. The hyperplane (“wall”) ξ1 = 0 cuts Rn into twoclosed halfspaces and cuts each brick which it cuts into two bricks. Translating A,if necessary, we can assure that at least one brick lies exactly inside each of thehalfspaces. Let A1 and A2 be the intersections of A with the halfspaces. Then bothA1 and A2 contain fewer bricks than A. Now, we start translating B. We translateB in such a way that the corresponding sets B1 and B2 (the intersections of B withthe halfspaces) satisfy

volB1

volB=

volA1

volA= γ,

say. We observe that both B1 and B2 are made of at most as many bricks as B.Hence the total number of bricks in the pair (A1, B1) is at most k− 1 and the totalnumber of bricks in the pair (A2, B2) is at most k − 1. We apply the inductionhypothesis to each pair and get

vol(αA1 + βB1) ≥ volα(A1) ln volβ(B1) and

vol(αA2 + βB2) ≥ volα(A2) volβ(B2).97

Page 98: Measure Concentration

However, αA1 + βB1 ⊂ αA + βB and αA2 + βB2 ⊂ αA + βB lie on the oppositesides of the wall ξ1 = 0 and so do not overlap.

Therefore,

vol(αA+ βB) ≥ vol(αA1 + βB1) + vol(αA2 + βB2)

≥ volα(A1) volβ(B1) + volα(A2) volβ(B2)

=γα volα(A)γβ volβ(B) + (1− γ)α volα(A)(1− γ)β volβ(B)

=γ volα(A) volβ(B) + (1− γ) volα volβ(B) = volα(A) volβ(B).

This completes the proof for sets that are finite unions of bricks.The final step consists of approximating reasonable sets A,B ⊂ Rn (say, Jordan

measurable) by finite unions of bricks.

Lecture 29. Monday, March 21

27. The needle decomposition techniquefor proving dimension-free inequalities

In this section we discuss a remarkable technique, called “the needle decomposi-tion” technique or the “localization technique” which allows one to prove inequali-ties in Rn by reducing them to an inequality in R1. This technique was developedand used for a variety of purposes by S. Bobkov, M. Gromov, V. Milman, L. Lovasz,M. Simonovits, R. Kannan, F. Nazarov, M. Sodin, A. Volberg, M. Fradelizi, and O.Guedon among others. While it is not easy to describe the method in all generality,we state one particular application as a theorem below in the hope that it capturesthe spirit of the method.

(27.1) Theorem. Let K ⊂ Rn be a convex body, let f : K −→ R be a continuousfunction, let µ be a log-concave measure on K, and let Φ,Ψ : R −→ R be continuousfunctions.

Suppose that for every interval I ⊂ K and for every log-concave measure ν on Ithe inequality

(27.1.1) Φ(

1ν(I)

∫I

f dν

)≤ 1ν(I)

∫I

Ψ(f) dν

holds.Then the inequality

(27.1.2) Φ(

1µ(K)

∫K

f dµ

)≤ 1µ(K)

∫K

Ψ(f) dµ

holds.98

Page 99: Measure Concentration

For example, suppose that Φ(x) = lnx and that Ψ(x) = c + lnx, where c is aconstant. Hence to establish the inequality

ln(

1µ(K)

∫K

f dµ

)≤ c+

1µ(K)

∫K

(ln f

)dµ,

for every log-concave measure µ on K, it suffices to establish the one-dimensionalversion of the inequality for every interval I ⊂ K.

Similarly, suppose that Φ(x) = xp for x > 0 and Ψ(x) = cpxp for x > 0, where cis a constant. Hence to establish the inequality

1µ(K)

∫K

f dµ ≤ c

(1

µ(K)

∫K

fpdµ

)1/p

,

for every log-concave measure µ on K, it suffices to establish the one-dimensionalversion of the inequality for every interval I ⊂ K and any log-concave measure νon I.

One more example: let us choose Φ(x) = −cex, where c > 0 is a constant andΨ(x) = −ex. Thus to prove the inequality

1µ(K)

∫K

ef dµ ≤ c exp 1µ(K)

∫K

f dµ,

it suffices to prove the one-dimensional version of the inequality.The idea of the proof is to iterate a certain construction, which we call “halving”,

which reduces the inequality over K to the inequalities over smaller convex bodies.These bodies become thinner and thinner, needle-like, and, in the limit look likeone-dimensional intervals.

(27.2) Halving a convex body.Let K ⊂ Rn be a convex body, let µ be any (not necessarily log-concave) mea-

sure on K with a positive continuous density function and let f : K −→ R be acontinuous function on K.

Our goal is to cut the convex body K by an affine hyperplane H into two non-overlapping convex bodies K1 and K2 in such a way that

(27.2.1)1

µ(K1)

∫K1

f dµ =1

µ(K2)

∫K2

f dµ.

Note that then we necessarily have

(27.2.2)1

µ(K)

∫K

f dµ =1

µ(Ki)

∫Ki

f dµ for i = 1, 2.

99

Page 100: Measure Concentration

Indeed,

1µ(K)

∫K

f dµ =1

µ(K)

(∫K1

f dµ+∫K2

f dµ

)=µ(K1)µ(K)

(1

µ(K1)

∫K1

f dµ

)+µ(K2)µ(K)

(1

µ(K2)

∫K2

fdµ

)=µ(K1) + µ(K2)

µ(K)

(1

µ(Ki)

∫K1

f dµ

)=

1µ(Ki)

∫Ki

f dµ.

Suppose for a moment that we managed to do that. Suppose further that theinequalities

Φ(

1µ(Ki)

∫Ki

f dµi

)≤ 1µ(Ki)

∫Ki

Ψ(f) dµi for i = 1, 2

hold. Then, in view of (27.2.2),

Φ(

1µ(K)

∫K

f dµ

)=Φ

(1

µ(Ki)

∫Ki

f dµi

)≤ 1µ(Ki)

∫Ki

Ψ(f) dµi for i = 1, 2.

Taking the convex combination of the above inequalities for i = 1, 2 with thecoefficients µ(Ki)/µ(K), we get the inequality

Φ(

1µ(K)

∫K

f dµ

)≤ 1µ(K)

∫K

Ψ(f) dµ.

Thus we reduced the inequality for a larger body K to the inequalities for smallerbodies K1 and K2.

There are plenty of ways to cut K into K1 and K2 so that (27.2.1) is satisfied.We impose one extra condition, after which there is essentially one way to do it.

Let us fix an (n − 2)-dimensional affine subspace L ⊂ Rn passing through aninterior point of K. Let H ⊂ Rn be a hyperplane passing through L. We have onedegree of freedom in choosing H. Indeed, let us choose a 2-dimensional plane Aorthogonal to L such that L∩A = a, say. Let us consider the orthogonal projectionpr : Rn −→ A along L. Then every hyperplane H ⊃ L is projected onto a linepassing through a, and to choose a line through a in A is to choose a hyperplaneH containing L. Hence we got a one-parametric continuous family of hyperplanesH(t), where 0 ≤ t ≤ π and H(0) = H(π). In other words, H(t) is obtained fromH(0) by rotating through an angle of t about L.

Each hyperplane H(t) defines two (closed) halfspaces H+(t) and H−(t) and,consequently, two convex bodies

K+(t) = K ∩H+(t) and K−(t) = K ∩H−(t).100

Page 101: Measure Concentration

It really doesn’t matter which one is which, but if we make L oriented, we canchoose K+(t) and K−(t) consistently, that is, continuously in t and in such a waythat

(27.2.3) K+(0) = K−(π) and K−(0) = K+(π).

Here is the moment of truth: as t changes from 0 to π, the two integrals

1µ (K+(t))

∫K+(t)

f dµ and1

µ (K−(t))

∫K−(t)

f dµ

interchange because of (27.2.3). Therefore, by continuity, there is t′ ∈ [0, π] suchthat

1µ (K+(t′))

∫K+(t′)

f dµ =1

µ (K−(t′))

∫K−(t′)

f dµ.

Hence we letK1 = K+(t′) and K2 = K−(t′)

and (27.2.1) is satisfied.

This construction allows us to reduce proving the inequality of Theorem 27.1for a convex body K to proving the inequality for smaller convex bodies. We aregoing to iterate the construction making bodies smaller and smaller. In the limit,we would like to make them 1-dimensional.

(27.3) Definition. A convex body K ⊂ Rn is called an ε-needle if there exists aline l ⊂ Rn such that all points of K are within distance ε from l.

(27.4) Constructing a needle decomposition. Let us choose an ε > 0. Let uschoose many, but finitely many, affine (n − 2)-dimensional subspaces L1, . . . , LNpiercing K in various directions. What we want is the following: for every affineplane A, the set of intersectionsA∩L1, . . . , A∩LN is an (ε/4)-net for the intersectionA∩K. More precisely, we want to choose L1, . . . , LN so dense that for every affinetwo-dimensional plane A the intersection A ∩ K contains no disc of radius ε/4without a point of the form A ∩ Li inside. A standard compactness argumentimplies the existence of such L1, . . . , LN . For example, choose a two-dimensionalplane A such that A ∩ K 6= ∅. Choose a sufficiently dense net in K ∩ A andconstruct affine subspaces Li orthogonal to A and passing through the points ofthe net. The set Li ∩ A will provide a sufficiently dense net for A ∩ K if a two-dimensional plane A is sufficiently close to A. Since the set of two-dimensionalaffine subspaces A intersecting K is compact, by repeating this construction withfinitely many subspaces A, we construct L1, . . . , LN .

Let us take the subspaces L1, . . . , LN one after another and apply the halvingprocedure. More precisely, we apply the halving procedure with L1, thus obtainingK1 and K2. Taking the subspace Li, we find all previously constructed convexbodies Kj , check for which of them Li passes through an interior point and halve

101

Page 102: Measure Concentration

those using the procedure of (27.2). In the end, we represent K as a finite union ofnon-overlapping convex bodies Ki, i ∈ I, such that

1µ(Ki)

∫Ki

f dµ =1

µ(K)

∫K

f dµ for all i ∈ I.

We claim that every Ki is an ε-needle.Indeed, let us pick a particular C = Ki. One thing is certain: none of the

subspaces L1, . . . , LN passes through an interior point of C. Because if it did, wewould have halved C along that subspace some time ago. Let a, b ∈ C be two pointswith the maximum value of dist(a, b) for a, b ∈ C. Let us draw a line through lthrough a and b. Let c ∈ C be any point and consider the triangle (abc). Theinterior of (abc) does not intersect any of the Li (otherwise, we would have halvedC). Therefore, the triangle (abc) does not contain a disc of radius greater than ε/4.Since a and b are the maximum distance apart, the angles at a an b are acute anda picture shows that the distance from c to l cannot be larger than ε.

Proof of Theorem 27.1. Since f,Φ, and Ψ are uniformly continuous on K, thereexist functions ωf , ωΦ, and ωΨ such that

dist(x, y) ≤ ε =⇒ |f(x)− f(y)| ≤ ωf (ε),

|x− y| ≤ ε =⇒ |Φ(x)− Φ(y)| ≤ ωΦ(ε), and

|x− y| ≤ ε =⇒ |Ψ(x)−Ψ(y)| ≤ ωΨ(ε)

for all ε > 0 andωf (ε), ωΦ(ε), ωΨ(ε) −→ 0 as ε −→ 0.

Let us choose an ε > 0 and let us construct a decomposition K =⋃i∈I Ki, where

Ki are non-overlapping ε-needles and

1µ(Ki)

∫Ki

f dµ =1

µ(K)

∫K

f dµ for i ∈ I.

For each needle Ki let us pick an interval I = Ii ⊂ Ki with the endpoints maximumdistance apart in Ki. Let pr : K −→ I be the orthogonal projection onto I, cf.Section 27.4. Let ν = νi be the push-forward measure on I. Then, by Theorem26.1, ν is log-concave.

Thus we have µ(Ki) = νi(Ii). Moreover, since dist (x, pr(x)) ≤ δ,∣∣∣ 1µ(Ki)

∫Ki

f dµ− 1νi(Ii)

∫Ii

f dνi

∣∣∣=∣∣∣ 1µ(Ki)

∫Ki

f(x) dµ− 1µ(Ki)

∫Ki

f(pr(x)

)dµ∣∣∣

≤ 1µ(Ki)

∫Ki

∣∣f(x)− f(pr(x)

)∣∣ dµ ≤ ω(ε)

102

Page 103: Measure Concentration

Therefore, ∣∣∣Φ( 1µ(Ki)

∫K

f dµ

)− Φ

(1

νi(Ii)

∫Ii

f dνi

) ∣∣∣ ≤ ωΦ (ωf (ε)) .

Similarly, ∣∣∣ 1µ(Ki)

∫Ki

Ψ(f) dµ− 1νi(Ii)

∫Ii

Ψ(f) dνi∣∣∣ ≤ ωΨ(ε).

Since

Φ(

1νi(Ii)

∫Ii

f dνi

)≤ 1νi(Ii)

∫Ii

Ψ(f) dνi,

we conclude that

Φ(

1µ(K)

∫K

f dµ

)=Φ

(1

µ(Ki)

∫Ki

f dµ

)≤ 1µ(Ki)

∫Ki

Ψ(f) dµ+ ωΦ (ωf (ε)) + ωΨ(ε) for all i ∈ I.

Taking the convex combination of the above inequalities with the coefficientsµ(Ki)/µ(K), we get

Φ(

1µ(K)

∫K

f dµ

)≤ 1µ(K)

∫K

Ψ(f) dµ+ ωΦ (ωf (ε)) + ωΨ(ε).

Taking the limit as ε −→ 0, we complete the proof.

Theorem 27.1 admits generalizations. Here is an interesting one. Let k < n bea positive integer and let Φ,Ψ : Rk −→ R be continuous functions. Let f1, . . . , fk :K −→ R be functions. Suppose we want to establish the inequality

Φ(

1µ(K)

∫K

f1 dµ, . . . ,1

µ(K)

∫K

fk dµ

)≤ 1µ(K)

∫K

Ψ (f1, . . . , fk) dµ

for any log-concave measure µ. Then it suffices to establish the inequality for all k-dimensional convex compact subsets of K with any log-concave measure ν on them.The “halving procedure” 27.2 still works, but it has to be modified as follows.

Lecture 31. Friday, March 25

Lecture 30 on Wednesday, March 23, covered the material from the previoushandout.

103

Page 104: Measure Concentration

27. The needle decomposition technique forproving dimension-free inequalities, continued

(27.5) Definition. Let us call a convex body K ⊂ Rn an (k, ε)-pancake if thereexists a k-dimensional affine subspace A ⊂ Rn such that all points of K are withindistance ε from A.

Thus we get the following version (extension) of Theorem 27.1.

(27.6) Theorem. Let K ⊂ Rn be a convex body, let µ be a measure on K with apositive continuous density, and let f1, . . . , fk : K −→ R be continuous functions.Then for any ε > 0 there is decomposition

K =⋃i∈I

Ki

of K into a finite union of non-overlapping (that is, with pairwise disjoint interiors)convex bodies Ki ⊂ K, i ∈ I, such that

(1) Each Ki is a (k, ε)-pancake;(2) The average value of every function fj on every piece Ki is equal to the

average value of fj on the whole convex body K:

1µ(Ki)

∫Ki

fj dµ =1

µ(K)

∫K

fj dµ

for all i ∈ I and j = 1, . . . , k.

The proof is very similar to that of Theorem 27.1. We have the following modi-fication of the “halving” procedure of Section 27.4.

(27.7) Lemma. Let K ⊂ Rn be a convex body, let µ be a measure on K with apositive continuous density, and let f1, . . . , fk : K −→ R be continuous functions.Let L ⊂ Rn be an affine subspace passing through an interior point of K and suchthat dimL = n − k − 1. Then there exists an affine hyperplane H ⊃ L which cutsK into two bodies K1 and K2 such that

1µ(K1)

∫K1

fj dµ =1

µ(K2)

∫K2

fj dµ for j = 1, . . . , k.

Proof. We parameterize oriented hyperplanes H ⊃ L by the points of the sphereSk. To do that, we choose the origin 0 ∈ L and project Rn along L onto L⊥

which we identify with Rk+1. Then each unit vector u ∈ Sk defines an orientedhyperplane H ⊃ L with the normal vector u. Note that u and −u define the samenon-oriented hyperplane but different oriented hyperplanes. Let K+(u) be the partof the convex body lying in the same halfspace as u and let K−(u) be the part of

104

Page 105: Measure Concentration

the convex body lying in the opposite halfspace. Note that K+(u) = K−(−u) andK−(u) = K+(−u). Let us consider the map

φ : Sk −→ Rk

defined by

φ(u) =(

1µ (K+(u))

∫K+(u)

f1 dµ−1

µ (K−(u))

∫K−(u)

f1 dµ, . . . ,

1µ (K+(u))

∫K+(u)

fk dµ−1

µ (K−(u))

∫K−(u)

fk dµ

).

Then φ is continuous and φ(−u) = −φ(u). Hence by the Borsuk-Ulam Theorem,there is a point u ∈ Sk such that φ(u) = 0. This point defines the desired affinehyperplane.

Note that we necessarily have

1µ(Ki)

∫Ki

fj dµ =1

µ(K)

∫K

fj dµ for j = 1, . . . , k and i = 1, 2.

Proof of Theorem 27.6. We choose many but finitely many (n−k− 1)-dimensionalaffine subspaces L1, . . . , LN , each intersecting the interior of K and such that forevery (k + 1)-dimensional affine subspace A, if the intersection A ∩ K containsa ball of radius δ = δ(ε), the ball contains a point of the type A ∩ Li for somei. The existence of such a set of subspaces follows by the standard compactnessargument. We construct the bodies Ki as follows: starting with K, for each of thesubspaces L1, . . . , LN , one after another, we find all previously constructed convexbodies for which the subspace passes through an interior point, and halve all suchbodies using Lemma 27.7. Eventually, we get the set of bodies Ki and Part (2) isclear. Moreover, for each Ki and any (k + 1)-dimensional affine subspace A, theintersection Ki ∩A does not contain a ball of radius δ, since otherwise there wouldhave been a subspace Lj intersecting Ki in its interior point and we would havehalved Ki along the way. We claim that we can choose δ (it may depend on thebody K as well) so that every Ki is a (k, ε)-pancake. One possible way to prove itgoes via the Blaschke selection principle. Suppose that there is an ε > 0 such thatfor any positive integer m there is a convex body Cm ⊂ K which does not containa (k + 1)-dimensional ball of radius 1/m but not a (k, ε)-pancake. Then there is aconvex body C ⊂ K which is the limit point of Cm in the Hausdorff metric. ThenC contains no (k+ 1)-dimensional ball at all and hence must lie in a k-dimensionalaffine subspace A ⊂ Rn. But then all Cm with sufficiently large m must be ε-closeto A, which is a contradiction.

Lecture 32. Monday, March 28

105

Page 106: Measure Concentration

28. Some applications: proving reverse Holder inequalities

(28.1) Reducing inequalities for general functions to inequalities for par-ticular functions. Suppose we want to prove reverse Holder inequalities for someclass of functions, say, polynomials of a given degree, which are valid for all convexbodies. Namely, we want to prove that for any positive integer d there exists a con-stant c(d) such that for all convex bodies K ⊂ Rn, for all polynomials f : Rn −→ Rsuch that deg f ≤ d and f(x) > 0 for all x ∈ K, one has

ln(

1volK

∫K

f dx

)≤ c(d) +

1volK

∫K

ln f dx.

Validity of such an inequality follows from results of J. Bourgain. It follows fromTheorem 27.1 that its suffices to check the above inequality only for polynomialsthat are essentially univariate, that is, have the structure

f(x) = g(T (x)

),

where g is a univariate polynomial and T : Rn −→ R is a projection. Indeed,choosing Φ(x) = lnx, Ψ(x) = c(d) + lnx, by Theorem 27.1, we conclude that itsuffices to check the inequality

ln(

1ν(I)

∫I

f dν

)≤ c(d) +

1ν(I)

∫I

ln f dν

For any interval I ⊂ K and any log-concave measure ν on I. By changing co-ordinates, we can reduce the inequality to the following one: Given a univariatepolynomial g with deg g ≤ d, which is non-negative on the interval [0, 1] and alog-concave measure ν on [0, 1], prove that

ln

(1

ν([0, 1]

) ∫ 1

0

g dν

)≤ c(d) +

1ν([0, 1]

) ∫ 1

0

ln g dν.

The structure of polynomials g non-negative on [0, 1] is well-known. Every such apolynomial is a convex combination of polynomials with all roots real and in theinterval [0, 1].

On the other hand, every log-concave measure is the limit of a projection of theLebesgue measure restricted to a convex body. Therefore, it suffices to prove theinequality

ln(

1volK

∫K

g(T (x)

)dx

)≤ c(d) +

1volK

∫K

ln g(T (x)

)dx,

where K ⊂ Rn is a convex body, T : K −→ R is a projection such that T (K) ⊂ [0, 1]and g is a univariate polynomial non-negative on the interval [0, 1]. Hence Theorem27.1 allows us to restrict the class of polynomials considerably.

106

Page 107: Measure Concentration

Moreover, suppose we do care about the ambient dimension n. Analyzing theproof of Theorem 27.1, we observe that ν is the limit of the projections of theLebesgue measure restricted to convex bodies K ⊂ Rn. Then ν is not only log-concave but a bit more is true. One can deduce from the Brunn-Minkowski Theoremfor the Lebesgue measure (Corollary 20.3) that if φ is the density of ν, then φ1/(n−1)

is concave. Let

K =

(ξ, η1, . . . , ηn−1) : 0 ≤ ξ ≤ 1 and

0 ≤ ηi ≤ φ1/(n−1)(ξ) for i = 1, . . . , n− 1.

Then K ⊂ Rn is a convex body and ν is the push-forward of the Lebesgue measurerestricted to K under the projection Rn −→ R, (ξ, η1, . . . , ηn−1) 7−→ ξ. This showsthat to establish the inequality

ln(

1volK

∫K

f dx

)≤ c(d, n) +

1volK

∫K

ln f dx

for any convex body K ⊂ Rn and any polynomial f with deg f ≤ d which isnon-negative on K, it suffices to check the inequality for polynomials f = g

(T (x)

),

where T : Rn −→ R is the projection, T (K) = [0, 1] and g is a univariate polynomialdeg g ≤ d, non-negative on [0, 1].

(28.2) Reducing an arbitrary log-concave measure to a log-affine mea-sure. In view of what’s been said, it would be helpful to understand the structureof log-concave measures on the interval. An important example of a log-concavemeasure on an interval is provided by a measure with the density eax+b for somenumbers a and b. We call such measures log-affine.

The following construction is a particular case of a more general result by M.Fradelizi and O. Guedon. Let us fix the interval [0, 1], a continuous function h :[0, 1] −→ R, a number a, and consider the set Xa(h) of all log-concave probabilitymeasures ν on the interval [0, 1] that satisfy the condition∫ 1

0

h dν = a.

Suppose that ν ∈ Xa(h) is a measure with a continuous density. Then, unless ν isa log-affine measure, it can be expressed as a convex combination

ν = α1ν1 + α2ν2 : α1 + α2 = 1 and α1, α2 ≥ 0

of two distinct measures ν1, ν2 ∈ Xa(h).We sketch a proof below.Without loss of generality, we may assume that a = 0 (otherwise, we replace h

with h− a). Next, we may assume that∫ x

0

h(t) dν(t) for 0 < x < 1

107

Page 108: Measure Concentration

does not change sign on the interval. Indeed, if it does, then there is a point0 < c < 1 such that ∫ c

0

h dν =∫ 1

c

h dν = 0.

We define ν1 as the normalized restriction of ν onto [0, c] and ν2 as the normalizedrestriction of ν onto [c, 1].

Hence we assume that∫ x

0

h(t) dν(t) ≥ 0 for all 0 ≤ x ≤ 1.

Let eφ(x) be the density of ν, so φ is a concave function on [0, 1]. We are goingto modify φ as follows. Let us pick a c ∈ (0, 1) and let ` be an affine function onthe interval [0, 1] such that `(c) < φ(c) and the slope of ` is some number α (to beadjusted later). Let

ψ(x) = minφ(x), `(x)

for 0 ≤ x ≤ 1

and let us consider the measure ν1 with the density eψ. Note that ψ is a concavefunction. Now, if α = +∞ (or close), so `(x) steeply increases, then eψ(x) = eφ(x)

for x > c and eψ(x) is 0 for x < c. Thus we have∫ 1

0

h dν1 =∫ 1

c

h dν ≤ 0.

Similarly, if α = −∞ (or close) then `(x) steeply decreases and eψ(x) = eφ(x) forx < c and eψ(x) is 0 for x > c. Thus we have∫ 1

0

h dν1 =∫ c

0

h dν ≥ 0.

This proves that there exists a slope α such that∫ 1

0

h dν1 = 0.

Let ν2 = ν − ν1. Then ν2 is supported on the interval where φ(x) > `(x) andits density there is eφ(x) − e`(x). We claim that ν2 is log-concave. Factoring thedensity of ν2 into the product e`(x)

(eφ(x)−`(x) − 1

), we reduce the proof to the

following statement: if ρ(x) = φ(x) − `(x) is a non-negative concave function,then ln

(eρ(x) − 1

)is concave. Since a concave non-negative function is a pointwise

minimum of a family of linear (affine) non-negative functions, it suffices to provethat

ln(eαx+β − 1

)108

Page 109: Measure Concentration

is concave whenever defined. This is obtained by checking that the second deriva-tive, equal to

− α2

(eαx+β − 1)2

is non-positive.Hence we obtain a decomposition ν = ν1 + ν2, where ν1 and ν2 are log-concave,

and ∫ 1

0

h dν1 =∫ 1

0

h dν2 = 0.

Normalizing ν1 and ν2, we complete the proof.Suppose now that we want to check the inequality

ln(∫ 1

0

g dν

)≤ c(d) +

∫ 1

0

ln g dν

for all log-concave probability measures ν on the interval [0, 1]. Let us fix the valueof

(28.2.1)∫ 1

0

g dν = a,

say. Then the infimum of ∫ 1

0

ln g dν

is attained on an extreme point of the set of log-concave measures ν satisfying(28.2.1), which must be a log-affine measure or a δ-measure.

Hence in proving reverse Holder inequalities on an interval, we can always restrictourselves to log-affine measures on the interval.

Similarly, M. Fradelizi and O. Guedon show that if Xa is the set of probabilitydensities φ such that ∫ 1

0

hφ(x) dx = a

and φ1/n is concave, then the extreme points of Xa are the densities `n(x), where` : [0, 1] −→ R is an affine function, non-negative on [0, 1].

Lecture 33. Wednesday, March 30

109

Page 110: Measure Concentration

29. Graphs, eigenvalues, and isoperimetry

We consider some global invariants in charge of the isoperimetric properties of ametric space. We discuss the discrete situation first.

Let G = (V,E) be a finite undirected graph without loops or multiple edges, withthe set V of vertices and the set E of edges. With this, V becomes a metric space:the distance dist(u, v) between two vertices u, v ∈ V is the length (the number ofedges) of a shortest path in G connecting u and v. Thus the diameter of this metricspace is finite if and only if G is connected. We consider connected graphs only.Given a vertex v ∈ V , the number of edges incident to v is called the degree of vand denoted deg v. With a graph G, we associate a square matrix A(G), called theLaplacian of G.

(29.1) Definition. Let G = (V,E) be a graph. Let us consider the |V | × |V |matrix A = A(G), A = au,v, where

au,v =

deg u if u = v

−1 if u, v ∈ E0 otherwise.

We think of A(G) as of an operator on the space RV of functions f : V −→ R.Namely, g = Af if

g(v) = (deg v)f(v)−∑u∈V

u,v∈E

f(u).

We consider the scalar product on RV defined by

〈f, g〉 = 〈f, g〉V =∑v∈V

f(v)g(v).

Here is the first result.

(29.2) Lemma. Let G = (V,E) be a graph. Let us orient every edge e ∈ E, sothat one vertex of e becomes the beginning (denoted e+) and the other becomes theend (denoted e−). Let L = L(G) (where ˜ stands for an orientation), L = le,v, bethe |E| × |V | incidence matrix of G:

le,v =

1 if v = e+

−1 if v = e−

0 otherwise.

ThenA(G) = LT (G)L(G).

110

Page 111: Measure Concentration

Proof. The (u, v)th entry of LTL is∑e∈E

le,ule,v.

If u = v then the eth term is 1 if and only if e is incident to v and 0 otherwise, sothe entry is deg v, If u 6= v and u, v is an edge, then the only one non-zero termcorresponds to the edge e = u, v and is equal to −1. If u 6= v and u, v /∈ E,then all the terms are 0.

We can think of L as of the linear transformation RV −→ RE , which, for eachfunction f : V −→ R computes the function g = Lf , g : E −→ R defined byg(e) = f(e+)− f(e−). We introduce the scalar product on RE by

〈f, g〉 = 〈f, g〉E =∑e∈E

f(e)g(e).

Then LT is the matrix of the conjugate linear transformation RE −→ RV . For afunction g : E −→ R, the function f = LT g, f : V −→ R is defined by

f(v) =∑e∈Ee+=v

g(e)−∑e∈Ee−=v

g(e).

We have

〈Lf, g〉E = 〈f, LT g〉V for all f ∈ RV and all g ∈ RE .

(29.3) Corollary. The matrix A(G) is positive semidefinite. If G is connected,then the eigenspace of A(G) corresponding to the eigenvalue λ = 0 consists of thefunctions f : V −→ R that are constant on all vertices:

f(v) = c for some c ∈ R and all v ∈ V.

Proof. Since A = LTL, we have

〈Af, f〉 = 〈LTLf, f〉 = 〈Lf, Lf〉 ≥ 0.

If f is a constant on the vertices of G, then Lf = 0 and 〈Af, f〉 = 0. On the otherhand, suppose that 〈Af, f〉 = 0. Let us pick two different vertices u, v ∈ V . Since Gis connected, there is an orientation G in which u and v are connected by a directedpath (say, u is the beginning of the path and v is an end). Let L = L(G) be thecorresponding operator. We must have Lf = 0, which means that f(e+) = f(e−)for every edge e of G. Since u and v are connected by a directed path, we musthave f(u) = f(v). Since u and v were arbitrary, f must be constant on V .

111

Page 112: Measure Concentration

PROBLEM. Check that in general, the 0th eigenspace of A consists of the func-tions that are constant on connected components of G.

In what follows, the crucial role is played the smallest positive eigenvalue λ =λ(G) of A(G). We will rely on the following simple consideration: if f ∈ RV is afunction orthogonal to the 0th eigenspace, then, of course,

〈Af, f〉 ≥ λ〈f, f〉 = λ‖f‖2.

This is called sometimes the Poincare inequality, sometimes the Rayleigh principle.If G is connected, being orthogonal to the 0th eigenspace means that∑

v∈Vf(v) = 0.

We are going to relate λ(G) with isoperimetric properties of V .

(29.4) Theorem. Let G = (V,E) be a connected graph and let X ⊂ V be a set ofvertices. Let E(X,V \X) be the set of all edges e ∈ E with one endpoint in X andthe other in V \X.

Then

|E(X,V \X)| ≥ λ(G)|X||V \X|

|V |

Proof. Let us consider the indicator [X] : RV −→ R, that is, the function equal to 1on the points from X and 0 otherwise. To be able to apply the Poincare inequality,we modify [X] it to make it orthogonal to the 0th eigenspace, for which purposewe subtract from [X] the average value of [X] on V , which is equal to |X|/|V | = p,say. Thus we consider the function f : V −→ R, f = [X]− p[V ],

f(v) =

1− p if v ∈ X−p if v /∈ X,

where p = |X|/|V |. We have

‖f‖2 =〈f, f〉 =⟨[X]− p[V ], [X]− p[V ]

⟩=〈[X], [X]〉 − 2p〈[X], [V ]〉+ p2〈[V ], [V ]〉 = |X| − 2p|X|+ p2|V |=|X| − 2p|X|+ p|X| = (1− p)|X|

=|X||V \X|

|V |.

We have

〈Af, f〉 ≥ λ‖f‖2 = λ|X||V \X|

|V |.

112

Page 113: Measure Concentration

Now we compute 〈Af, f〉 directly. We have

Af = A[X]− pA[V ] = A[X].

So〈Af, f〉 = 〈Af, [X]− p[V ]〉 = 〈Af, [X]〉 − p〈Af, [V ]〉 = 〈A[X], X〉,

where 〈Af, [V ]〉 = 0 since [V ] lies in the 0th eigenspace, f is orthogonal to the 0theigenspace and so is Af .

Let us denote g = A[X]. Then

g(v) = (deg v)[X]−∑u∈V

u,v∈E

[X](u).

If v ∈ X then g(v) is equal to the number of edges with one endpoint at v and theother outside of X. Therefore,

〈A[X], [X]〉 =∑v∈V

g(v) = |E(X,V \X)|,

from which the proof follows.

(29.5) Eigenvalues and products. Given two graphs Gi = (Vi, Ei), i = 1, 2, wedefine their product G = (V,E) as the graph with the set V = V1 × V2 of verticesand the edges between vertices (u1, u2) and (w1, w2), if either u1 = w1 ∈ V1 andu2, w2 ∈ E2 or u2 = w2 ∈ V2 and u1, w1 ∈ E1. Then

A(G) = A(G1)⊗ IV2 + IV1 ⊗A(G2).

From this, the eigenvalues of A(G) are the pairwise sums of the eigenvalues ofA(G1) and A(G2) (the corresponding eigenvectors are the tensor products of theeigenvectors for A(G1) and A(G2)). Hence

λ(G) = minλ(G1), λ(G2)

.

For example, the 1-skeleton In of the n-dimensional cube is the nth power of thegraph consisting of two vertices and the edge connecting them. Hence λ(In) = 2. Ifwe choose X to the the set of vertices on one facet of the cube, then the left handside in the formula of Theorem 29.4 is equal to 2n−1, and this is exactly the righthand side.

Friday, April 1

113

Page 114: Measure Concentration

Student’s presentation: Bourgain’s Theoremon embedding of finite metric spaces

Monday, April 4

Student’s presentation: Talagrand’s convex hull inequality

Wednesday, April 6

Student’s presentation: Talagrand’sconvex hull inequality, continued

Friday, April 8

Student’s presentation: the isoperimetric inequality on the sphere

Monday, April 11

Student’s presentation: the isoperimetricinequality on the sphere, continued

Wednesday, April 13

Visitor’s presentation: concentration forthe singular values of a random matrix

Friday, April 15

Visitor’s presentation: concentration for thesingular values of a random matrix, continued

114

Page 115: Measure Concentration

Monday, April 18

Student’s presentation: a proof of theisoperimetric inequality in the Boolean cube

The last lecture: prepared but not delivered (end of term)

29. Graphs, eigenvalues, and isoperimetry, continued

Theorem 29.4 implies the following bound.

(29.6) Corollary. Let G = (V,E) be a connected graph and let d be the minimumdegree of a vertex v ∈ V . Then

λ(G) ≤ |V ||V | − 1

d.

Proof. Let X = v with deg v = d in Theorem 29.4. Then E(X,V \X) = d.

Theorem 29.4 can be extended to a more general situation. The following resultis due to N. Alon and V. Milman.

(29.7) Theorem. Let G = (V,E) be a connected graph and let A,B ⊂ V be twodisjoint subsets. Let E(A) be the set of all edges with both endpoints in A and letE(B) be the set of edges with both endpoints in B. Let ρ be the distance between Aand B, that is, the minimum of dist(u, v) with u ∈ A and v ∈ B. Then

|E| − |E(A)| − |E(B)| ≥ λ(G)ρ2 |A||B||A|+ |B|

.

Indeed, taking A = X, B = V \X and ρ = 1, we get Theorem 29.4.

Proof. The proof consists of applying the inequality 〈Aff, f〉 ≥ λ‖f‖2 to a speciallyconstructed function.

Let

a =|A||V |

and b =|B||V |

and let

g(v) =1a− 1ρ

(1a

+1b

)min

(dist(v,A), ρ

).

Furthermore, let

p =1|V |

∑v∈V

g(v)

115

Page 116: Measure Concentration

and let us define f : V −→ R by

f(v) = g(v)− p.

Now, ∑v∈V

f(v) = 0,

so we may (and will) apply the inequality

〈Af, f〉 ≥ λ〈f, f〉

to this particular function f .First, we note that g(v) = 1/a for v ∈ A and g(v) = −1/b for v ∈ B. Then

〈f, f〉 =∑v∈V

f2(v) ≥∑

v∈A∪Bf2(v) =

∑v∈A∪B

(g(v)− p

)2=∑v∈A

(g(v)− p

)2 +∑v∈B

(g(v)− p

)2 =∑v∈A

(1a− p

)2

+∑v∈B

(−1b− p

)2

=∑v∈A

(1a2− 2p

a+ p2

)+∑v∈B

(1b2

+2pa

+ p2

)=|V |2

|A|+|V |2

|B|+ p2 (|A|+ |B|) ≥ |V |

(1a

+1b

).

Next, we observe that if u, v ∈ E then

|g(u)− g(v)| ≤ 1ρ

(1a

+1b

), and, consequently |f(u)− f(v)| ≤ 1

ρ

(1a

+1b

).

Let us orient G in an arbitrary way, and let L be the corresponding operator, sothat A = LTL. Then

〈Af, f〉 =〈Lf, Lf〉 = 〈Lg, Lg〉 =∑e∈E

(g(e+)− g(e−)

)2=

∑e∈E\(E(A)∪E(B))

(g(e+)− g(e−)

)2≤(|E| − |E(A)| − |E(B)|

) 1ρ2

(1a

+1b

)2

.

Therefore, 〈Af, f〉 ≥ λ〈f, f〉 implies that(|E| − |E(A)| − E(B)|

)≥ λρ2|V |

(1a

+1b

)−1

= λρ2 |A||B||A|+ |B|

.

Theorem 29.7 admits an extension to the continuous case.116

Page 117: Measure Concentration

30. Expansion and concentration

Let G = (V,E) be a graph and let X ⊂ V be a set of vertices. We would like toestimate how fast the ε-neighborhood of X grows. Theorem 29.4 provides a way todo it. Suppose that deg v ≤ d for all v ∈ V . For an X ⊂ V and a positive r, let

X(r) =v ∈ V : dist(v,X) ≤ r

be the r-neighborhood of X. Let us introduce the counting probability measure µon V :

µ(A) =|A||V |

for A ⊂ V.

We note that every edge e ∈ E(X,V \X) is incident to one vertex in X and onevertex in X(1) \X. Therefore,

|X(1) \X| ≥ 1dE(X,V \X) ≥ λ(G)

d

|X||V \X||V |

.

Therefore,

µ(X(1) \X

)≥ λ

dµ(X)

(1− µ(X)

)and

µ(X(1)

)≥(

1 +λ

d

(1− µ(X)

))µ(X).

In particular,

(30.1) µ(X(1)

)≥(

1 +λ

2d

)µ(X) if µ(X) ≤ 1

2,

Which demonstrates a sustained grows of the neighborhood as long as the measureof the set stays below 1/2. Iterating, we get for positive integer r

µ(X(r)

)≥(

1 +λ

2d

)rµ(X) provided µ

(X(r − 1)

)≤ 1

2.

One can deduce a concentration result from this estimate. It follows that

µ(X(r)

)≤ 1

2=⇒ µ(X) ≤ 1

2

(1 +

λ

2d

)−r.

Let Y ⊂ V be a set such that µ(Y ) ≥ 1/2 and let X = V \Y (r). Then X(r) ⊂ V \Yand so µ

(X(r)

)≤ 1/2. Thus we obtain

µ(Y ) ≥ 12

=⇒ µv ∈ V : dist(v, Y ) > r

≤ 1

2

(1 +

λ

2d

)−r,

which is a concentration result of some sort. There are examples of graphs forwhich the estimate thus obtained is quite reasonable, but often it is too weak. Forexample, for the 1-skeleton of the n-dimensional cube with λ/2d = 1/n (see Section29.5) it is nearly vacuous.

One can get stronger estimates if instead of Theorem 29.4 we use Theorem 29.7.117

Page 118: Measure Concentration

(30.2) Theorem. Let G = (V,E) be a connected graph and let X ⊂ V be a set.Suppose that deg v ≤ d for all v ∈ G. Then, for r ≥ 1, we have

µ(X(r)

)≥(

1 +λ(G)r2

2d

)µ(X) provided µ

(X(r)

)≤ 1

2.

Proof. Let us apply Theorem 29.7 to X = A and B = V \ X(r). Then the setE \

(E(A) ∪ E(B)

)consists of the edges with one endpoint in X(r) \ X and the

other outside of X(r) \X. Therefore,

|X(r) \X| ≥ λ(G)r2

d

|X||V \X(r)||X|+ |V \X(r)|

.

Since|V \X(r)|

|X|+ |V \X(r)|=(

1 +|X|

|V \X(r)|

)−1

≥ 12,

the result follows.

Note that the estimate holds for not necessarily integer r.Comparing the estimate of Theorem 30.2 with that of (30.1), we observe that

the former is stronger if the ratio λ/d is small, since the iteration of (30.1) producesthe lower bound (

1 +λ

2d

)r≈ 1 +

λr

2d< 1 +

λr2

2d.

We obtain the following corollary.

(30.3) Corollary. Let G = (V,E) be a connected graph and let A ⊂ V be a setsuch that µ(A) ≥ 1/2. Suppose that deg v ≤ d for all v ∈ V . Then, for t ≥ 1, wehave

µv ∈ V : dist(v,A) > t

≤ 1

2infr:

t>r>1

(1 +

λr2

2d

)−bt/rc.

Proof. We consider B = V \A(t). Thus B(t) ⊂ V \A and hence µ(B(t)

)≤ 1/2. Let

s = bt/rc and let us construct a sequence of subsets X0 = B,X1 = X0(r), . . . , Xk =Xk−1(r) for k ≤ s. Thus Xs = Brs ⊂ Bt, so µ

(Xk

)≤ 1/2 for k = 0, 1 . . . , s.

Applying Theorem 30.2 to X0, X1, . . . , Xs, we conclude that

12≥ µ(Xs) ≥

(1 +

λr2

2d

)sµ(B)

and hence

µ(B) ≤ 12

(1 +

λr2

2d

)−s.

118

Page 119: Measure Concentration

Now we optimize on r.

As follows from Corollary 29.6, λ/2d ≤ 1, so we can choose

r =

√2dλ.

in Corollary 30.3. This gives us the estimate

µv ∈ V : dist(v,A) > t

≤ 2−bt

√λ/2dc−1

In the case of the 1-skeleton of the cube In, we have λ = 2 and 2d = 2n, we getan apper bound of the type exp

−ct/

√n

, which is, although substantially weakerthan the bound of Corollary 4.4, say, is not obvious and provides, at least, the rightscale for t so that A(t) is almost everything.

119