optimal transport meets graph spectrastaff.ustc.edu.cn/~spliu/ccnu13draft.pdf · programming and...

Optimal transport meets graph spectra

Shiping Liu

Department of Mathematical Sciences, Durham University

Abstract

This note stem from a mini-course delivered to students in physics in the MurrayGell-Mann Forum, October 8 - 12, 2013 in Central China Normal University basedon materials in the two books of Villani (particularly Chapter 1 and section 2.1-2.2 of[6], Chapter 3,5,6 of [7]) and the papers [2, 3, 1]. We explain, in the setting of discretespaces, the definition of optimal transport distance, its dual formula, the exact formulafor probability measures supported on one-dimensional space, its relation to linearprogramming and various interactions with graph spectra.

Remark 1. I will not pursue to give strict mathematical proofs, but instead try toexplain what’s going on with a formula and the underlying intuitions.

1. Basics about optimal transport problem on discrete spaces

Let’s start from a simple (even naive) transport problem.

x factoryor production unit

y shopor consumption place

Suppose you are the manager who encounters the problem of transporting productsfrom a factory x to a shop y. x can be a bakery producing breads, y can be a cafeshop. The amount of products at x, denoted by µ(x) equals to the amount neededat y, denoted by ν(y). That is

µ(x) = ν(y). (1.1)

Then it makes no difference in mathematics to normalize both quantities to be 1 unit.That is, we can suppose

µ(x) = ν(y) = 1. (1.2)

Email address: [email protected] (Shiping Liu)URL: http://www.maths.dur.ac.uk/~cttr73/ (Shiping Liu)

1

Now, according to your financial interest, you are definitely considering the cost forthis transportation. It is natural to say the cost for transporting one unit productsfrom x to y, denoted by c(x, y), is proportional to the distance between x and y,denoted by d(x, y). Through this note, we will consider always the case

c(x, y) = d(x, y). (1.3)

Therefore in the simple transport problems above, the cost you should pay is d(x, y).However, the transport problem can be more complicated.

x1 y2

x3

x2

y3

y1

Usually, there exist multiple locations both for factories and shops. We again nor-malize µ, ν such that

3∑i=1

µ(xi) =3∑j=1

ν(yj) = 1. (1.4)

Let’s first suppose every factory has the same producing ability, and every shop has thesame consumption ability. Then what you have to decide is to transport the productsfrom which factory to which shop. Generally, you have n sites xi, i = 1, 2, . . . , n forfactories and another n sites yj, j = 1, 2, . . . , n for shops such that

µ(xi) = ν(yj) =1

n, ∀ i, j. (1.5)

You are supposed to find an appropriate one-to-one map

φ : {xi, i = 1, 2, . . . , n} → {yj, j = 1, 2, . . . , n} (1.6)

such that the cost you need to pay

c =n∑i=1

d(xi, φ(xi))µ(xi) (1.7)

is the lowest possible one. Then we denote the optimal transport cost as

W1({xi}, {yj}) = infφ

n∑i=1

d(xi, φ(xi))µ(xi). (1.8)

2

Remark 2. If this inf can be attained, one can replace it by min. For those who arenot familiar with these two notions, think about the following example.

infn∈N

(1 +

1

n

)= 1,

whereas the minimum doesn’t exist.

This is called Monge’s problem. (Monge consider more general case that thefactories and shops are continuously distributed.) Monge is a French mathematician,born in 1746. His paper in 1781 firstly studied the problem of minimizing the cost fortransporting soils from one place to the construction places. He was one of Napoleon’sclosed friends.

Let’s think about this problem more carefully. In reality, such a map φ maynot exist, due to the fact that factories can have different producing abilities andshops can have different amount of comsumption. Let’s depict this phenomena in thefollowing figure.

x1 y2

x3

x2

y3

y1

3/5

1/5

2/5

2/5

1/5

1/5

In the above case, the transport plan can not be represented by a map φ : {xi} → {yj}.In this case, a more workable concept is coupling of measures.

For the present generality, each µ(xi) is not necessarily of the same value. Insteadit is now a function

µ : {xi, i = 1, 2, . . . , n} → R+, such thatn∑i=1

µ(xi) = 1. (1.9)

In mathematics, we call µ a probability measure. In fact, here it is a discrete measurewith finite support (i.e. it takes values only on finite discrete points). Similarly, νis a (finite discrete) probability measure on {yj, j = 1, 2, . . . ,m} (m can be differentfrom n). Sometimes, we also write

µ =n∑i=1

µ(xi)δxi , where δxi(xj) =

{1, if j = i;0, if j 6= i.

(1.10)

Now the cost is for the transport between these two measures µ and ν. Ergo weadjust our notion to W1(µ, ν).

3

You now need to decide to transport how many units of products form xi to yj,for any i = 1, 2, . . . , n and j = 1, 2, . . . ,m. That is, you need to decide an quantityξ(xi, yj) for every pair of i and j. In fact this is the coupling of µ and ν. Explicitly,ξ is a function

ξ : {xi}ni=1 × {yj}mj=1 → R≥0. (1.11)

Naturally, this function should satisfy the following constrains.

1. The amount of products moving out of xi should be µ(xi), i.e.,

m∑j=1

ξ(xi, yj) = µ(xi), ∀i; (1.12)

2. The amount of products moving into yj should be ν(yj), i.e.,

n∑i=1

ξ(xi, yj) = ν(yj), ∀j. (1.13)

(1.12) and (1.13) together define a coupling of two measures.

Example 1. In our previous case, the map can be written as the following coupling.

ξ(xi, yj) =

{µ(xi), if yj = φ(xi);0, otherwise.

(1.14)

The difference between this coupling and the previous map is that products from thesame factory can be split into several parts for different destinations in the couplingcase.

Now we can write the optimal cost for the transport as

W1(µ, ν) = infξ

coupling of µ,ν

n∑i=1

m∑j=1

d(xi, yj)ξ(xi, yj). (1.15)

Remark 3. (1.15) is not only more suitable for describing the real case, but alsoeasier to handle mathematically. (1.15) is a linear minimizing problem, while (1.8) isa nonlinear problem. In fact, the infimum in (1.15) can always be attained. Thereforeone can replace the inf by min.

(1.15) is studied by Kantorovich, a Russian mathematician, born in 1912. In1938, a laboratory consulted him for the solution of a certain optimization problem.Start from that motivation, he developed the tools of linear programming, thatlater became prominent in economics. In 1975, he was awarded the Nobel Prize for

4

economics, jointly with Tjalling Koopmans, ”for their contributions to the theory ofoptimum allocation of resources”.

It was only several years after his main results that Kantorovich made the con-nection with Monge’s work. The problem has since then been called the Monge-Kantorovich problem. The explicit connection between this problem and linearprogramming will come later.

In fact, when we consider the space (or set) of all probability measures with finitediscrete support, W1 serves as a metric. That’s why we also call it Wassersteindistance (Wasserstein is another mathematician).

Theorem 1. The space of probability measures with finite discrete support PD cou-pled with W1 is a metric space.

What is a metric space?

A metric gives the basic geometric information of a space. Let’s take the spaceR2 as a toy example.

d(p1 = (x1, y1), p2 = (x2, y2)) =√

(x2 − x1)2 + (y2 − y1)2. (1.16)

p3

p2=(x2,y2)

p1=(x1,y1)

The important property is the following triangle inequality.

d(p1, p3) + d(p3, p2) ≥ d(p1, p2). (1.17)

In general, we can consider an abstract space X, say PD, and a metric (a notion ofdistance) d : X ×X → R. Just think about in which case d can serve as a distancefunction in a similar role as that in R2.

1. d(x, y) ≥ 0 (nonnegativity);

2. d(x, y) = 0⇔ x = y;

3. d(x, y) = d(y, x);

4. d(x, y) + d(y, z) ≥ d(x, z).

5

(X, d) satisfying the above properties for any x, y, z ∈ X is called a metric space.

Remark 4. • Playing a little more with those rules, one can find that the firstone can be omitted (i.e. it can be derived from the other properties);

• If the second property doesn’t hold, we call d a pseudo-metric.

For the theorem, we only need to argue that W1 satisfies the triangle inequality,i.e. for any µ1, µ2, µ3 ∈ PD, we have

W1(µ1, µ3) ≤ W1(µ1, µ2) +W1(µ2, µ3). (1.18)

Remark 5. If the supports of µi are all single points, this reduce to the triangleinequality of d.

Recall why it is true for the Eucledean distance function on R2. That is because

d(p1, p2) = minall possible paths joining p1,p2

{ length of a path connecting p1, p2} (1.19)

and segments p1p3 and p3p1 is the shortest paths connection p1, p3 and p3, p2 respec-tively while union of p1p3 and p3p2 is only one possible path joining p1, p2. Here forWasserstein distance, the situation is quite similar. Recall (1.15) and the fact thatcouplings ξ12 of µ1, µ2, and ξ23 of µ2, µ3 induce naturally ξ13, a coupling of µ1, µ3

(formally shown in the figure below).

µ1µ1

µ3

µ2

µ3

We can choose ξ12, ξ23 to be the optimal coupling in their own transport problemsrespectively. Due to the fact that the cost function d satisfying triangle inequality,the transport cost for ξ13 is smaller than

W1(µ1, µ2) +W1(µ2, µ3).

Since ξ13 is only one possible coupling of µ1, µ3, we get (1.18).

6

2. One-dimensional case and graph spectral distance

As we saw from the last section, W1(µ, ν) for two probability measures µ, ν isdefined by a minimizing problem. A natural question now is

Is it possible to give an explicit formula for calculating W1(µ, ν)?

The answer in general would be no. The only known case is for probability mea-sures supported on the real line, i.e. the support points of the measure lie on R. Let’skeep in mind the following picture.

x1 x4x3x2

y1 y4y3y2

The explicit formula for W1 in this case would be expressed in terms of anotherequivalent description (or characterization) of probability.

Definition 1. The cumulative distribution function of a one dimensionalprobability measure µ is defined as

Fµ(x) :=

∫ x

−∞dµ = µ((−∞, x]). (2.1)

Note by definition, Fµ(−∞) = 0, Fµ(+∞) = 1, and F is a nondecreasing step function(the graph of which typically looks like the following figure).

x1 x3x20

Theorem 2. For two one dimensional probability measures µ, ν, we have

W1(µ, ν) =

∫R|Fµ(x)− Fν(x)|dx (2.2)

7

Remark 6. This is the L1 distance between the cumulative distribution functions.In general if the cost for transporting one unit products from x to y equals dp(x, y),where p is a positive integer, the formula is

Wp(µ, ν) = infξ

coupling of µ,ν

n∑i=1

m∑j=1

dp(xi, yj)ξ(xi, yj) (2.3)

=

(∫ 1

0

|F−1µ (t)− F−1

ν (t)|pdt) 1

p

, (2.4)

for some properly defined inverse function F−1µ . When p = 1, (2.4) equals (2.2). This

can be easily seen by drawing pictures of the cumulative distribution functions as inthe following example. One will find that both formula for p = 1 stand for the area ofthe shadowed region there,

Example 2. Consider the following two probability measures.

0 1

1/2 µ1/2

0 1/2 3/2

1/3 1/31/3 ט

Their cumulative distribution functions are shown as follows.

0 1/2 1 3/2 2

FµFט

1

Instead of the precise proof, let’s only try to understand the formula (2.2) by thinkingof the above example.

0 1

1/2 µ1/2

1/3 1/31/3 ט

8

The above picture provides one possible transport ”coupling”, which produce anupper bound of W1. Precisely,

W1(µ, ν) ≤(

1

2− 1

3

)× 1

2+

(2

3− 1

2

)× 1

2+

(1− 2

3

)× 1

2(2.5)

=

∫R|Fµ(x)− Fν(x)|dx. (2.6)

Explanations of (2.5): Look at the figure above. At point 0, µ has 1/2− 1/3units of products more than ν, which should be moved to the point 1/2. Then atpoint 1/2, ν still needs 1/3− (1/2− 1/3) = 2/3− 1/2 units, which can be taken fromµ(1). After that, µ has 1/2− (2/3− 1/2) = 1− 2/3 units of excess products at point1, which should be moved to the point 3/2.

Hence we check the correctness of the formula (2.2) in one direction in (2.6). Forthis simple example, it is not difficult to check this upper bound is indeed the optimalcost by an ad-hoc argument of listing all possible couplings. However, in general

How to argue ”≤” is an ”=”?

This will be answered in the next section using a dual formula. For quantities definedby a infimum problem, it is always easy to get an upper bound estimate. Dually,for quantities defined by a supremum problem, it will be easy to ge an lower boundestimates. If the lower and upper bound of one quantity coincide, we then get theexact value. This will be the philosophy we will explore in the next section.

Applications:

Recall in the last section, we introduce the abstract notion of metric space. Nowlet’s consider the set of all finite graphs. One may ask,

Is their any proper distance function (metric) for the space of finitegraphs?

This question is partially motivated by the requirement in biology when one studythe evolution process of living creatures. Recently, jointly with Jiao Gu and BoboHua, we employ W1 to suggest one possible choice of a metric [2].

Given a finite graph G with N vertices, we have the eigenvalue sequence

0 = λ0 ≤ λ1 ≤ λ2 ≤ · · ·λN−1 ≤ 2. (2.7)

This provides N points on the real line. We then assign the spectral (probability)measure

µG =1

N

N−1∑i=0

δλi (2.8)

9

to the graph G. And the spectral distance between two graphs G, G′ can then bedefined as

d(G,G′) := W1(µG, µG′). (2.9)

This is in fact a pseudometric on the space of finite graphs, since when d(G,G′) = 0,G and G′ can still be different. (For readers who are interested in this respect, it ishelpful to explore more materials about cospectral graphs. )

Importantly, this distance is monotonic with respect to the evolutionary dis-tance when applied to biological networks.

For two general probability measures µ, ν which are not necessarily a graph spec-tral measure, we have

W1(µ, ν) ≤ 2. (2.10)

Furthermore, ”=” can be attained. (Think of the example that µ, ν have the singlepoints 0, 1 as their supports respectively.) However, when we restrict ourselves to thegraph spectral measures, we have better estimate.

Theorem 3 (Gu-Hua-Liu [2]).

d1(G,G′) ≤ 1.

Let us try to understand this result. (The proof evolves quite some technical details.)The question is what is the special point for a graph spectral measure? It is thefollowing trace formula.

N−1∑i=0

λi = N, or1

N

N−1∑i=0

= 1. (2.11)

This property can be depicted in the figure below as the fact that the area of theshadowed region equals 1.

0 1/2 1 3/2 2

1

λ0 λ2λ1

Note further the area of the rectangle [0, 2]× [0, 1] is 2, we get∫ 2

0

FµG(x)dx = 1. (2.12)

10

This means the graph of the cumulative distribution functions of spectral graph mea-sures stand around the diagonal of the rectangle in the above figure. Therefore∫

R|FµG(x)− FµG′ (x)|dx

looks very likely being smaller than one. In fact, the above observations are exactlythe main point of our proof.

Moreover, ”=” can be attained if we set the following notation. If we insist thatthe trace formula (2.11) still hold, then the spectra for the graph of a single vertexhshould be assigned the value 1. Recall a connected graph with two vertices has thespectra 0, 2. Hence we have

d1( , ) = 1. (2.13)

Therefore we can say the diameter of the space of finite connected graphs equalsone.

Concerning our purpose of application, sizes of graphs are usually very large. Fortwo large graphs, if their difference is small, then their distance should be close tozero. This phenomena is capture by the following result.

For a graph G of size N , let G′ be the graph obtained from G by finite C steps ofedit operations. Here by edit operations we include for example the following cases.

• deleting an edge;

• vertex replication;

• contracting an edge.

11

Theorem 4 (Gu-Hua-Liu [2]). Let G, G′ be as above, then

d1(G,G′) ≤ C1

N, (2.14)

where C is independent of N , depending only on the types and number of edit opera-tions.

To get such an upper bound estimate, one only need to choose a particular transferplan (coupling). The ingredient here is the so-called ”interlacing inequalities” foreigenvalues of two graphs G and G′ described as above. The point here is that Cdoes not depend on N . Note further that while Theorem 3 is a diameter estimate,Theorem 4 is a description of the behavior of the distance when N is large.

3. Duality formula and linear programming

Let us come back to the question we asked after the explanation of (2.5). Sincethe distance W1 is defined by ”inf”, it is comparably easy to get an upper bound.Then how to prove ”=” holds? Image that we have a quantity defined by ”sup”, thenit will be comparably easy to obtain a lower bound. Once we have both, it will beeasier to prove ”=”. Explicitly, if we have

W1 ≤ c and W1 ≥ c, (3.1)

then W1 = c.We now start to pursue such a situation. Recall

W1(µ, ν) = infξ≥0∑n

i=1 ξ(xi,yj)=ν(yj)∑mj=1 ξ(xi,yj)=µ(xi)

n∑i=1

m∑j=1

d(xi, yj)ξ(xi, yj). (3.2)

We try to rewrite the above formula in a more dense way. It is easy to arrange ξ asan n×m matrix,

ξ(x1, y1) ξ(x1, y2) · · · ξ(x1, ym)ξ(x2, y1) ξ(x2, y2) · · · ξ(x2, ym)

......

. . ....

ξ(xn, y1) ξ(xn, y2) · · · ξ(xn, ym)

. (3.3)

We can also write it as a vector (vectorization of the above matrix), i.e., we write itline by line such that ξ ∈ Rnm and

ξ = (ξ(x1, y1), ξ(x1, y2), . . . , ξ(x1, ym), ξ(x2, y1), . . . , ξ(xn, y1), . . . , ξ(xn, ym))T . (3.4)

12

Note that we take the transpose to make it a column vector. Similarly we can writethe distances between xis and yjs to be a column vector d ∈ Rnm. Then the we have

n∑i=1

m∑j=1

d(xi, yj)ξ(xi, yj) = d · ξ (inner product of two vectors) . (3.5)

We further denote η ∈ Rn+m,

η := (µ(x1), . . . , µ(xn), ν(y1), . . . , ν(ym))T . (3.6)

Then we can rewrite the constrains{ ∑ni=1 ξ(xi, yj) = ν(yj),∑mj=1 ξ(xi, yj) = µ(xi),

(3.7)

asAT ξ = η (AT is a (n+m)× nm matrix). (3.8)

Exercise 1. Find A.

In conclusion, we have rewritten the definition for W1 as

W1(µ, ν) = infξ≥0

AT ξ=η

d · ξ. (3.9)

Now we can start the following calculations.

infξ≥0

AT ξ=η

d · ξ = infξ≥0

(d · ξ +

{0, if AT ξ = η;+∞, otherwise.

)

= infξ≥0

(d · ξ + sup

Φ∈Rn+m

(−Φ · (AT ξ − η))

).

Here we introduce a new variable Φ ∈ Rn+m,

Φ = (φ(x1), φ(x2), . . . , φ(xn), φ(y1), φ(y2), . . . , φ(ym))T .

Bearing in mind that {xi} and {yj} may overlap, and it would be beneficial if we canregard φ as a function, we modify the new variable to be

Φ = (φ(x1), φ(x2), . . . , φ(xn), ψ(y1), ψ(y2), . . . , ψ(ym))T .

13

Now we continue the calculation as follows,

infξ≥0

AT ξ=η

d · ξ = infξ≥0

supΦ

(d · ξ −

=AΦ·ξ︷︸︸︷Φ · AT ξ+Φ · η) = inf

ξ≥0sup

Φ((d− AΦ) · ξ + Φ · η)

= supΦ

(infξ≥0

(d− AΦ) · ξ + Φ · η)

= supΦ

(η · Φ− sup

ξ≥0(AΦ− d) · ξ

)= sup

Φ

(η · Φ−

{0, if AΦ ≤ d;+∞, otherwise.

)= sup

AΦ≤dη · Φ.

Remark 7. The change of the order of ”inf” and ”sup”, i.e. the first ”=” in thesecond line of the above calculation, is not trivial. One can check that for the functionf : R× R→ R, (x, y) 7→ sin(x+ y),

supx

infyf(x, y) = −1 while inf

ysupxf(x, y) = 1.

In fact, it is easy to to prove the following.

Exercise 2. Prove for any two variable function f , it holds that

supx

infyf(x, y) ≤ inf

ysupxf(x, y).

However, the inverse direction is not so trivial, and its correctness needs furtherconstraints for the function f(x, y) (typically, convexity properties). In fact, this isone of the main topics in Game Theory. It can be used to prove the existence ofNash equilibrium. We call such kind of results minimax principle. (Those who areinterested in are invited to the appendix in the end where more details about thisconnection will be explained.)

But here in our case, the minimax principle does work. We only consider for ourfunction the simple case that every vector is a number, then the function would bein the following forms

f(x, y) = ax+ by − cxy, where a, b, c ∈ R are coefficients. (3.10)

The graph of such a function typically looks like

14

−10−5

05

10

−10

0

10

−300

−200

−100

0

100

200

300

xy

f(x,y)=x+y−4xy

In such a saddle surface, the order of ”inf” and ”sup” does not matter. Image you arestanding at one point on this surface, you will arrive at the same destination whenyou adopt the following two different strategies.

• First walk in the x-axis direction to the highest point and then in the y-axisdirection to the lowest point;

• First walk in the y-axis direction to the lowest point and then in the x-axisdirection to the highest point.

OK, let’s stop the discussion about the minimax principle and come back to ourcalculation. Actually we have obtain

infξ≥0

AT ξ=η

d · ξ = supAΦ≤d

η · Φ, (3.11)

where η, d is the fixed vector, A is a matrix and ξ,Φ are our variables.The right hand side of (3.11) is the Linear Programming problem, i.e. linear

optimization problem with convex constraints. (Recall that this was invented byKantorovich.)

It is easy to see

η · Φ =n∑i=1

φ(xi)µ(xi) +m∑j=1

ψ(yj)ν(yj).

Exercise 3. Check that AΦ ≤ d can be written as φ(xi) + ψ(yj) ≤ d(xi, yj),∀i, j.

15

In conclusion, we arrive at the following dual formula

W1(µ, ν) = supφ,ψ

φ(xi)+ψ(yj)≤d(xi,yj),∀i,j

(n∑i=1

φ(xi)µ(xi) +m∑j=1

ψ(yj)ν(yj)

). (3.12)

For later purpose, we make a tiny modification of the above formula, replacing φ by−φ.

W1(µ, ν) = supφ,ψ

ψ(yj)−φ(xi)≤d(xi,yj),∀i,j

(m∑j=1

ψ(yj)ν(yj)−n∑i=1

φ(xi)µ(xi)

). (3.13)

So nothing really changes.

How to understand this dual formula?

Let’s come back to our factory-shop model. Suppose now there comes another guy,say Bob, who tell you, ”Hi, don’t worry about what to be moved to where. Let mehandle this for you. I will try to ship the products from the factory to your shops.At the factory xi, I need to pay them at a price φ(xi) for taking out every unit ofproducts. And I will ask from you at the shop yj with a price of ψ(yj) for every unitof products. In order to make you want me to handle this deals according to yourfinancial interests, I should set the prices φ, ψ such that

ψ(yj)− φ(xi) ≤ d(xi, yj),∀i, j,

since otherwise you will buy the products at xi and ship them to yj by yourself.”In the above process, Bob’s income is

m∑j=1

ψ(yj)ν(yj)−n∑i=1

φ(xi)µ(xi).

Then the duality formula (3.13) tell us that if Bob is clever enough, he can get asmuch money as what you need to pay if you manage the transport by yourself.

Can we reduce the number of variables in the ”sup” from two to one?

Let’s solve this problem by reconsider Bob’s situation. Say, at present we have afixed income price φ and outcome price ψ such that

−φ(xi) + ψ(yj) ≤ d(xi, yj),∀i, j.

16

How can Bob increase his income by resetting the price ψ? Since he know fromabove ψ(yj) ≤ φ(xi) + d(xi, yj), he can improve ψ(yj) to the highest lower bound ofφ(xi) + d(xi, yj), ∀i, i.e. he improve it to

ψ1(yj) = infxi

(φ(xi) + d(xi, yj)). (3.14)

Symmetrically, suppose ψ1 is already fixed. Since φ(xi) ≥ ψ(yj) − d(xi, yj), we canimprove φ(xi) to the lowest upper bound of ψ1(yj)− d(xi, yj), i.e.

φ1(xi) = supyj

(ψ1(yj)− d(xi, yj)). (3.15)

One may expect that this process can be continued recursively infinite times,

φ→ ψ1 → φ1 → ψ2 → φ2 → · · · .

However, it turns out it will stop soon. In fact, we have

Proposition 1.ψ2 = ψ1.

The argument for this proposition is easy. By the intuition above, ψ2 improves ψ1,therefore we have

ψ2 ≥ ψ1. (3.16)

Similarly, φ1 improves φ, hence φ1 ≤ φ. Then we obtain

ψ2(yj) = infxi

(φ1(xi) + d(xi, yj)) ≤ infxi

(φ(xi) + d(xi, yj)) = ψ1(yj). (3.17)

Now (3.16) and (3.17) together implies the proposition.Therefore the formula (3.13) still holds if we replace φ, ψ by φ1, ψ1. (Note given a

function φ, we can derive ψ1 and φ1 then.)

W1(µ, ν) = supφ

(m∑j=1

ψ1(yj)ν(yj)−n∑i=1

φ1(xi)µ(xi)

). (3.18)

Proposition 2. If x ∈ {xi}ni=1

⋂{yj}mj=1, i.e. the supports of µ, ν share the point x,

thenφ1(x) = ψ1(x).

17

Arguments: By (3.17), we have

ψ1(x) = ψ2(x) ≤ φ1(x) by taking xi = x in the inf . (3.19)

We need a concept, so-called 1-Lipschitz function here. A function ψ : {xi}ni=1 → Ris called 1-Lipschitz if

|ψ(xi)− ψ(xj)| ≤ d(xi, xj), ∀i, j.

Note we call it 1-Lipschitz because the constant before d(xi, yj) can be taken as 1.

Example 3. Distance function for a metric space is a 1-Lipschitz function by triangleinequality.

Recall in the definition of ψ1, the distance function d(xi, yj) is the main ingredient.Therefore it is not so hard to prove the following fact.

Exercise 4. Prove ψ1 is a 1-Lipschitz function.

By this face, we can derive the other direction of estimation,

φ1(x) = supyj

{ψ1(yj)− d(x, yj)} ≤ ψ1(x). (3.20)

Now (3.19) and (3.19) together implies φ1(x) = ψ1(x).Thanks to Zhao Longfeng’s comments, we can also easily understand this fact

from the factory-shop model. If the factory and the shop are at the same place, thenthe prices should coincide since the transportation is free.

Now we can combine φ1, ψ1 to be one function.

Ψ(x) =

{φ1(xi), if x = xi;ψ1(yj), if x = yj.

(3.21)

Due to Proposition 2, this function is well-defined. Then we can continue thecalculation from (3.18).

W1(µ, ν) = supφ

(m∑j=1

ψ1(yj)ν(yj)−n∑i=1

φ1(xi)µ(xi)

)

= supφ

(m∑j=1

Ψ(yj)ν(yj)−n∑i=1

Ψ(xi)µ(xi)

)

≤ supΨ,1-Lipschitz

(m∑j=1


Ψ(xi)µ(xi)

)≤ W1(µ, ν).

18

The last inequality in the above uses the original duality formula (3.13). Then weknow all the ≤ should be =. Therefore we successfully obtain the following formulawhich positively answer our question.

W1(µ, ν) = supΨ,1-Lipschitz

(m∑j=1


Ψ(xi)µ(xi)

). (3.22)

This is called Kantorovich-Robinstein duality formula. One only need to find a par-ticular 1-Lipschitz function on supp(µ) ∪ supp(ν) to get a lower bound of W1(µ, ν).

Exercise 5. Prove the one dimensional formula for Example 2 in section 2.

4. Triangles and related geometric topics

Let’s work on a finite graph in this section. For any two neighboring vertices xand y, we consider the corresponding two probability measures, mx and my, where

mx(z) =

{1dx, when z ∼ x;

0, otherwise.(4.1)

mx(z) is the probability for a person standing at x to be at z at the next step if hemoves randomly. Bearing in mind the following figure.

x y

Solid points are the support of mx whereas empty circles are that of my. We observethat the common points of supports of mx and my together with x, y constitutegeometric figure of triangles. We denote the number of common neighbors of x, y, orequivalently number of triangles including x, y, by ](x, y).

Then we may ask for the optimal transport distance

W1(mx,my).

In fact, this quantity is used to define some kind of curvature notion by Ollivier [5].

Definition 2 (Ollivier). For any two points x, y ∈ V , the curvature κ(x, y) alongx, y is defined as

κ(x, y) := 1− W1(mx,my)

d(x, y). (4.2)

19

In the above d(x, y) is the graph distance, i.e. the minimal number of edges in theshortest path connecting x and y. For neighboring vertices, this curvature reads

κ(x, y) = 1−W1(mx,my).

Let me first introduce certain intuitions of curvature in geometry. And then wecome back to the curvature and geometry of graphs.

We know three typical constant curved surface look like the following figures.

curvature>0 curvature=0 curvature<0

x

x

x

Image to squash these three surfaces on to a table, then we would roughly get thefollowing.

curvature=0 curvature<0

x x

curvature>0

x

gapexcess

Or mathematically, we calculate the area of the disk

D(x, r) := {y ∈ surface : d(x, y) < r}.

We know for Euclidean plane, i.e. the zero curvature case, this area is πr2. Forsphere, i.e. the positive curvature case, the growth rate of the area of D(x, r) is lowerthan r2, and when r is large enough, it just stop increasing. For the saddle surface,i.e. the negative curvature case, the growth rate is faster than r2.

Intuitions: the smaller the curvature is, the faster the volume grows.We can think about the volume growth of graphs. If we fixed the degree of every

vertex, trees should have the fastest volume growth rate.

20

x x

y y

We can count the number of vertices in D(x, 1) in the tree (left of above figures) is4 and that of D(x, 2) is 10. For the graph on the right, the number of vertices inD(x, 1) is also 4 but that of D(x, 2) is only 7. Obviously the existence of trianglesprevent the volume from growing fast. By this intuition the larger the number oftriangles is, the slower the volume growth rate is, and then the larger the curvatureis. I will convince the readers this fact by the following results.

Theorem 5 (Lin-Yau [4], Jost-Liu [3]). For a locally finite graph G = (V,E), onany pair of neighboring vertices x ∼ y, we have

κ(x, y) ≥ −2

(1− 1

dx− 1

dy

)+

. (4.3)

Moreover, trees attains the equality above.

In the above, we use the positive part of a real number,

a+ :=

{a, if a ≥ 0;0, otherwise.

We use this here just in order to include the case that degree dx or dy can be 1, i.e.one of the vertices is a leaf. It is easy to see in that case κ(x, y) = 0. This a littlecomplicated notion, used here only for a tiny special case, however represents theright form for a general formula.

Theorem 6 (Jost-Liu [3]). We have for x ∼ y,

κ(x, y) ≥ −(

1− 1

dx− 1

dy− ](x, y)

dx ∧ dy

)+

−(

1− 1

dx− 1

dy− ](x, y)

dx ∨ dy

)+

+](x, y)

dx ∨ dy.

In the above we use ∧ standing for taking minimal value of two numbers and ∨ theminimal. We also invite the reader to Bauer-Jost-Liu [1] for a version of this resultfor weighted graphs (also permitting self-loops). This formula convince us that thecurvature will become larger when the number of triangles increases. ](x, y) 6= 0

21

means you have some mass that does not need to be moved, hence the transport costbecome smaller and the curvature larger.

There is also a quite direct upper bound for the curvature, see [3].

κ(x, y) ≤ ](x, y)

dx ∨ dy. (4.4)

The point for this estimate is that, positive curvature implies ](x, y) > 0, and thenthe graph can not be bipartite. Recall a basic property for the spectra 0 = λ0 ≤ λ1 ≤· · · ≤ λN−1 ≤ 2 of normalized Laplace operator on a finite graph is that

λN−1 = 2 ⇐⇒ G is bipartite.

Therefore it is reasonable to have

κ > 0 =⇒ λN−1 < 2.

In fact, the following holds.

Theorem 7 (Ollivier [5]). If κ(x, y) ≥ k > 0 for any x ∼ y, we have

k ≤ λ1 ≤ · · · ≤ λN−1 ≤ 2− k. (4.5)

An extension of this result which produces nontrivial estimates for general finitegraphs (no matter the curvature is strictly positive or not) is given in [1].

I will conclude this section by explaining how to prove on a tree the equality in(4.3) holds for two neighboring vertices x, y such that dx, dy > 1. That is, to proveon a tree, for x ∼ y, which satisfy dx, dy > 1,

W1(mx,my) = 3− 2

dx− 2

dy. (4.6)

This will be a perfect demonstration for the applications of the formulas for optimaltransport distance (1.15) and (3.22) we discussed in this lecture.

Firstly, we use the dual formula (3.22). We choose the 1-Lipschitz function Ψ asfollow. (The number at every vertex stands for the function value we put there.)

x y

3

3

21

0

0

0

22

Then we calculate

W1(mx,my) ≥1

dy(3(dy − 1) + 1)− 1

dx× 2 = 3− 2

dx− 2

dy. (4.7)

Secondly, we use the formula (1.15). For this end, we need to find a particularcoupling ξ. That is equivalently to find a transfer plan which allows splitting theproducts from the same factory. Since in our case,

1− 1

dx− 1

dy≥ 0,

i.e. 1 − 1/dx ≥ 1/dx and 1 − 1/dy ≥ 1/dx. Therefore we can transfer the productsneeded at vertex x exclusively from x’s neighbors (not including y) , and transferthe products all the products at y to its neighbors (but not x). This changes thedistribution of products from the first figure to the second one below. The cost forevery particular transport in this step is 1.

x yx y x y x y

From the second distribution to the final one above, we only need to transport theproducts at x’s neighbors (not including y) to the neighbors of y (not including x).The cost for every particular transport for this step is then 3. In conclusion we get aupper bound of the total cost

W1(mx,my) ≤1

dx× 1 +

1

dy× 1 +

(1− 1

dx− 1

dy

)× 3 = 3− 2

dx− 2

dy. (4.8)

This together with (4.7) proves the equality (4.6).

23

5. Answers to Exercises

Answer to Exercise 1 :n blocks︷︸︸︷

AT =

1 1 · · · 1 0 0 · · · 0 0 0 · · · 00 0 · · · 0 1 1 · · · 1 0 0 · · · 0...

.... . .

......

.... . .

... · · · ......

. . ....

0 0 · · · 0 0 0 · · · 0 1 1 · · · 11 0 · · · 0 1 0 · · · 0 1 0 · · · 00 1 · · · 0 0 1 · · · 0 0 1 · · · 0...

.... . .

......

.... . .

... · · · ......

. . ....

0 0 · · · 1 0 0 · · · 1 0 0 · · · 1

n

m

︸︷︷︸m

. (5.1)

AT is composed of 2 × n blocks. The upper part is composed of n ×m blocks andthe lower part is composed of m×m blocks.

Answer to Exercise 2 : For any x, y ∈ R, we have

infsf(x, s) ≤ f(x, y) ≤ sup

tf(t, y).

This implies immediately supx infs f(x, s) ≤ infy supt f(t, y).

Answer to Exercise 4: Recall ψ1(y) = infx{φ(x) +d(x, y)}. For any i, j, denote x0

to be the point whereψ1(yj) = φ(x0) + d(x0, yj).

Then we calculate

ψ1(yi)−ψ1(yj) ≤ φ(x0)+d(x0, yi)−(φ(x0)+d(x0, yj)) ≤ d(x0, yi)−d(x0, yj) ≤ d(xi, yj).

Symmetrically, one can argue ψ1(yj)− ψ1(yi) ≤ d(yj, yi). This completes the proof.

Appendix: Minimax principle and Nash equilibrium

The nontrivial part of minimax principle, that is for a function f(x, y),

infy

supxf(x, y) ≤ sup

xinfyf(x, y), (5.2)

is closely related to the existence of Nash equilibrium. The contents in this appendixis based on the notes taken from the classes of Prof. Jurgen Jost in MIS MPI Leipzigon the topic of Game Theory.

24

Consider a two players game. Alice has two options of actions x1 and x2, whereasBob also have two, y1 and y2. Naturally x1, y1 belong to the same space K1, so dox2, y2 (they belong to K2). Then we denote the income function for them as

πA, πB : K1 ×K2 → R.

Denote V = K1 ×K2. We take two auxiliary functions f : V × V → R, such that

f(x, y) = πA(x1, y2)− πA(y1, y2) + πB(y1, x2)− πB(y1, y2).

where x = (x1, x2) ∈ V, y = (y1, y2) ∈ V .Suppose both f satisfy a minimax principle (5.2) (whose correctness is definitely

not free), then we have

infy

supxf(x, y) ≤ sup

xinfyf(x, y) ≤ sup

xf(x, x) = 0.

This inequalities imply that there exist a x∗ ∈ V , such that for any x ∈ V ,

f(x, x∗) ≤ 0.

This then produce for any x ∈ V

πA(x1, x∗2) ≤ πA(x∗1, x

∗2),

πB(x∗1, x2) ≤ πB(x∗1, x∗2).

(This is because for fixed x∗, the first two terms of f is a function of x1 whereas thelast two terms are a function of x2. They behave then independently when x freelychanges.) This is to say, at x∗ = (x∗1, x

∗2), single change of actions of either Alice or

Bob will decrease their income both. Hence x∗ is a Nash equilibrium point.As we have commented, f can not have the property (5.2) freely. A very general

condition under which f do have property (5.2) is due to a result of Ky Fan. Ky Fanis a very famous Chinese mathematician, who was born in Hangzhou in 1914 andstudied in Paris during 1939-1941. Let’s end this appendix by his famous theorem.

Theorem 8 (Ky Fan). Let V be a compact, convex, non-empty subset of a topolog-ical vector space. Let f : V × V → R satisfies

• concavity in the first argument,

f(m∑i=1

aixi, y) ≥m∑i=1

aif(xi, y),

for all xi, y ∈ V and ai > 0,∑m

i=1 ai = 1;

25

• lower semi-continuity in the second argument,

f(x, y) ≤ lim infn→∞

f(x, yn)

whenever yn converge to y ∈ V ,

then we haveinfy

supxf(x, y) ≤ sup

yf(y, y). (5.3)

This is slightly weaker than (5.2), however is enough for deriving existence of Nashequilibrium point as we discussed above.

References

[1] F. Bauer, J. Jost, S. Liu, Ollivier-Ricci curvature and the spectrum of the nor-malized graph Laplace operator, Math. Res. Lett. 19(2012), no. 6, 1185-1205.

[2] J. Gu, B. Hua, S. Liu, Spectral distance on graphs, preprint, 2013.

[3] J. Jost, S. Liu, Ollivier’s Ricci curvautre, local clustering and curvature dimen-sion inequalities on graphs, Discrete Comput. Geom. online first.

[4] Y. Lin and S. T. Yau, Ricci curvature and eigenvalue estimate on locally finitegraphs, Math. Res. Lett. 17 (2010) 343-356.

[5] Y. Ollivier, Ricci curvature of Markov chains on metric spaces, J. Funct. Anal.256(2009), no. 3, 810-864.

[6] C. Villani, Topics in optimal transportation, Graduate Studies in Mathematics,58. American Mathematical Society, Providence, RI, 2003.

[7] C. Villani, Optimal transport, Old and new, Grundlehren der MathematischenWissenschaften, 338. Springer-Verlag, Berlin, 2009.

26

optimal transport meets graph spectrastaff.ustc.edu.cn/~spliu/ccnu13draft.pdf · programming and...

Documents