the geometry of distributions i classification of...

49
The Geometry of Distributions I Classification of Distances Suresh Venkatasubramanian University of Utah

Upload: others

Post on 21-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

The Geometry of Distributions I

Classification of Distances

Suresh VenkatasubramanianUniversity of Utah

Page 2: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Histograms And Distributions

Finite (and fixed) domain: { the, apple, orange, and }“text” over that domain: “the apple and the orange andthe orange”(Normalized) “frequency counts” over the domain: { 3/8,1/8, 1/4, 1/4}

Distribution is a point on the d-simplex:

∆d = {(x1, x2, . . . xd+1) |∑ xi = 1}

Page 3: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Comparing Distributions

Page 4: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Comparing Distributions

Page 5: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Comparing Distributions

Page 6: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Data Analysis ≡ Geometry

ProblemFind interesting patterns in a collection of data

Page 7: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Data Analysis ≡ Geometry

ProblemFind interesting patterns in a collection of data

Page 8: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Data Analysis ≡ Geometry

ProblemFind interesting patterns in a collection of data

Page 9: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Data Analysis ≡ Geometry

ProblemFind interesting patterns in a collection of data

Page 10: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Geometry Must Possess Right Properties

Assume that all points lie in Euclidean space

except...if v ∈ Rd, then c · v ∈ Rd for all c ∈ R

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

if v, w ∈ Rd, then v + w ∈ Rd

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

Page 11: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Geometry Must Possess Right Properties

Assume that all points lie in Euclidean space

except...if v ∈ Rd, then c · v ∈ Rd for all c ∈ R

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

if v, w ∈ Rd, then v + w ∈ Rd

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

Page 12: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Geometry Must Possess Right Properties

Assume that all points lie in Euclidean space

except...if v ∈ Rd, then c · v ∈ Rd for all c ∈ R

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

if v, w ∈ Rd, then v + w ∈ Rd

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

Page 13: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Distance Must Have Meaning

Instead ofThe distance between the two objects is 5.3

we needThe distance between the two objects is 5.3, and against

the null hypothesis that they are the same, this has a p-valueof 0.001

Page 14: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Comparing Models Rather Than Data

View data as being generated by models, and compare modelsinstead of data.

Page 15: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Comparing Models Rather Than Data

View data as being generated by models, and compare modelsinstead of data.

Page 16: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Lecture Plan

Informationgeometry

Algorithms forinformationdistances

Spatially-awareinformation distances

Page 17: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Lecture Plan

Informationgeometry

Algorithms forinformationdistances

Spatially-awareinformation distances

Page 18: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Distributions are generated by parameters

Gaussian x ∼ N (µ, σ), p(x; µ, σ) ∝ exp(− ‖x−µ‖2

σ2 )

Poisson k ∼ Pois(λ), p(k; λ) = λk exp(−λ)k!

Multinomial p(x1, . . . , xk; n, θ1, . . . θk) =n!

x1!···xk! θx11 θx2

2 · · · θxkk ,

∑ θi = 1

Page 19: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Manifold of distributions

Space of parameters of a distribution forms a manifold.

Geodesics measure distance.

Page 20: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Manifold of distributions

Space of parameters of a distribution forms a manifold.

Geodesics measure distance.

Page 21: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Riemannian Geometry

L(γ) =∫

γ

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Page 22: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Riemannian Geometry

ds

L(γ) =∫

γ

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Page 23: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Riemannian Geometry

ds

L(γ) =∫

γ

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Page 24: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Riemannian Geometry

ds

L(γ) =∫

γ

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Page 25: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Riemannian Geometry

ds

L(γ) =∫

γ

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Page 26: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Fisher Information

Let p(x; θ) be a parametric family of distributions.Set s = (s1, s2, . . . sk)

>, si =∂ log p(x;θ)

∂θi.

gij = E[sisj] =∫

p(x; θ)∂ log p(x; θ)

∂θi

∂ log p(x; θ)

∂θjdx

= −∫

p(x; θ)∂2 log p(x; θ)

∂θj∂θidx

G = {gij} is the Fisher Information

Fisher information acts like a “curvature” of the manifoldHigh Fisher information implies easier estimation of θfrom data.Given a parametric family, Fisher information inducesmetric structure on manifold

Page 27: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Fisher Information

Let p(x; θ) be a parametric family of distributions.Set s = (s1, s2, . . . sk)

>, si =∂ log p(x;θ)

∂θi.

gij = E[sisj] =∫

p(x; θ)∂ log p(x; θ)

∂θi

∂ log p(x; θ)

∂θjdx

= −∫

p(x; θ)∂2 log p(x; θ)

∂θj∂θidx

G = {gij} is the Fisher Information

Fisher information acts like a “curvature” of the manifoldHigh Fisher information implies easier estimation of θfrom data.Given a parametric family, Fisher information inducesmetric structure on manifold

Page 28: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Example: Gaussian distributions

Consider {N(µ, σI) | µ ∈ Rd−1, σ ∈ R+}

log p(x; θ) = −d−1

∑l=1

(xl − µl)2

σ2

−∂2 log p(x; θ)

∂θj∂θi=

1σ2 δij, i, j < d

E[−∂2 log p(x; θ)

∂σ2 ] =2(d− 1)

σ2

After some rescaling,

gij =1σ2 δij

which induces d-dimensional hyperbolic space

Note: If σ = 1 we recover Euclidean geometry.

Page 29: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Example: Multinomials

Consider {p(x1, . . . , xd; n, θ1, . . . , θd) | ∑ θi = 1}

log p(x; θ) = ∑ xi log θi

∂2 log p(x; θ)

∂θj∂θi= − x

θ2i

δij

A few steps later...

∑ gijdsidsj = ∑i

ds2i

θi

By a standard transformation from the simplex to thesphere, this yields the Euclidean inner product !Geodesics between distributions are great circles on thesphere.Hellinger distance is now chordal distance on sphere.

Page 30: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

From a metric to a distance

Metric tensor only gives infinitesimal distance (‖ds‖2).

To find shortest paths, we need to minimize path length

Structure What you getManifold Topology

Differentiability Tangent spaceMetric tensor Infinitesimal length

Affine connection Globally minimum paths

Metric tensor induces “natural” connectionIn statistical manifolds, many connections can be defined(parametrized by α)

Different values of α yield different Bregman divergences,f -divergences, α-divergences.

Page 31: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

From a metric to a distance

Metric tensor only gives infinitesimal distance (‖ds‖2).To find shortest paths, we need to minimize path length

Structure What you getManifold Topology

Differentiability Tangent spaceMetric tensor Infinitesimal length

Affine connection Globally minimum paths

Metric tensor induces “natural” connectionIn statistical manifolds, many connections can be defined(parametrized by α)

Different values of α yield different Bregman divergences,f -divergences, α-divergences.

Page 32: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

From a metric to a distance

Metric tensor only gives infinitesimal distance (‖ds‖2).To find shortest paths, we need to minimize path length

Structure What you getManifold Topology

Differentiability Tangent spaceMetric tensor Infinitesimal length

Affine connection Globally minimum paths

Metric tensor induces “natural” connectionIn statistical manifolds, many connections can be defined(parametrized by α)

Different values of α yield different Bregman divergences,f -divergences, α-divergences.

Page 33: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

From a metric to a distance

Metric tensor only gives infinitesimal distance (‖ds‖2).To find shortest paths, we need to minimize path length

Structure What you getManifold Topology

Differentiability Tangent spaceMetric tensor Infinitesimal length

Affine connection Globally minimum paths

Metric tensor induces “natural” connectionIn statistical manifolds, many connections can be defined(parametrized by α)

Different values of α yield different Bregman divergences,f -divergences, α-divergences.

Page 34: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

qi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

qi

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

qi)2]

12

Page 35: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

qi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1

χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

qi

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

qi)2]

12

Page 36: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

qi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

qi

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

qi)2]

12

Page 37: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

qi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

qi

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

qi)2]

12

Page 38: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

qi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

qi

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

qi)2]

12

Page 39: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

The Rogues’ Club

Bregman divergence For convex φ : Rd → R

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

α-divergence For |α| < 1,

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

f-divergence For convex f : R→ R, f (1) = 0,

Df (p, q) = ∑i

pif (qi

pi)

f = − log x, φ = x log x and α→ −1 all give KL(p, q).

Page 40: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

The Rogues’ Club

Bregman divergence For convex φ : Rd → R

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

α-divergence For |α| < 1,

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

f-divergence For convex f : R→ R, f (1) = 0,

Df (p, q) = ∑i

pif (qi

pi)

f = − log x, φ = x log x and α→ −1 all give KL(p, q).

Page 41: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

The Rogues’ Club

Bregman divergence For convex φ : Rd → R

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

α-divergence For |α| < 1,

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

f-divergence For convex f : R→ R, f (1) = 0,

Df (p, q) = ∑i

pif (qi

pi)

f = − log x, φ = x log x and α→ −1 all give KL(p, q).

Page 42: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

The Rogues’ Club

Bregman divergence For convex φ : Rd → R

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

α-divergence For |α| < 1,

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

f-divergence For convex f : R→ R, f (1) = 0,

Df (p, q) = ∑i

pif (qi

pi)

f = − log x, φ = x log x and α→ −1 all give KL(p, q).

Page 43: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Invariance Properties and Cencov’s theorem

Page 44: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Euclidean Invariants

scaling

shearing rotation

14.64 mm

28.22 mm

14.64 mm

6.48 mm

Page 45: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Markov Transformations

A is a column-stochastic matrix if

∀j, ∑i

aij = 1

If p is a distribution, then so is Ap.Information cannot be increased: for any such A,

d(Ap, Aq) ≤ d(p, q)

Page 46: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Sufficient Statistics

Let X ∼ p(x; θ), and T be a transformation of X. T is a sufficientstatistic if

p(x; θ, T(X)) = p(x; T(X))

Example: Let X ∼ N (µ, σ), x1, . . . xn be samples, andT(X) = ( 1

n ∑ xi, 1n−1 ∑(x− 1

n ∑ xi)2).

Informally, a sufficient statistic captures all information in thedistribution.

Page 47: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Cencov’s theorem

TheoremThe Fisher information is the unique (modulo scaling) metric tensorthat remain invariant under Markov transformations that aresufficient.

Page 48: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

Coming up

Informationgeometry

Algorithms forinformationdistances

Spatially-awareinformation distances

Page 49: The Geometry of Distributions I Classification of Distancescgl.uni-jena.de/pub/Workshops/WebHome/SureshLecture1.pdf · Geodesics between distributions are great circles on the sphere

References I

Shun-Ichi Amari and Hiroshi Nagaoka.Methods of information geometry.Oxford University Press., 2000.

L. L. Campbell.An extended Cencov characterization of the informationmetric.Proc. Amer. Math. Soc., 98(1):135–141, 1986.

Guy Lebanon.Riemannian Geometry and Statistical Machine Learning.PhD thesis, CMU, 2005.

N. N. Cencov.Statistical Decision Rules and Optimal Inference.American Mathematical Society, 1982.Originally published in Russian, Nauka, 1972.