the geometry of distributions i classification of...

The Geometry of Distributions I

Classification of Distances

Suresh VenkatasubramanianUniversity of Utah

Histograms And Distributions

Finite (and fixed) domain: { the, apple, orange, and }“text” over that domain: “the apple and the orange andthe orange”(Normalized) “frequency counts” over the domain: { 3/8,1/8, 1/4, 1/4}

Distribution is a point on the d-simplex:

∆d = {(x1, x2, . . . xd+1) |∑ xi = 1}

Comparing Distributions

Data Analysis ≡ Geometry

ProblemFind interesting patterns in a collection of data

Geometry Must Possess Right Properties

Assume that all points lie in Euclidean space

except...if v ∈ Rd, then c · v ∈ Rd for all c ∈ R

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

if v, w ∈ Rd, then v + w ∈ Rd

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

−5 · (0.1, 0.2, 0.5) = (−0.1,−0.2,−0.5)

(0.6, 0.1, 0.7) + (0.6, 0.2, 0.1) = (1.2, 0.3, 0.8)

Distance Must Have Meaning

Instead ofThe distance between the two objects is 5.3

we needThe distance between the two objects is 5.3, and against

the null hypothesis that they are the same, this has a p-valueof 0.001

Comparing Models Rather Than Data

View data as being generated by models, and compare modelsinstead of data.

Comparing Models Rather Than Data

View data as being generated by models, and compare modelsinstead of data.

Lecture Plan

Informationgeometry

Algorithms forinformationdistances

Spatially-awareinformation distances

Lecture Plan

Informationgeometry

Distributions are generated by parameters

Gaussian x ∼ N (µ, σ), p(x; µ, σ) ∝ exp(− ‖x−µ‖2

Poisson k ∼ Pois(λ), p(k; λ) = λk exp(−λ)k!

Multinomial p(x1, . . . , xk; n, θ1, . . . θk) =n!

x1!···xk! θx11 θx2

2 · · · θxkk ,

∑ θi = 1

Manifold of distributions

Space of parameters of a distribution forms a manifold.

Geodesics measure distance.

Manifold of distributions

Space of parameters of a distribution forms a manifold.

Geodesics measure distance.

Riemannian Geometry

L(γ) =∫

√‖ds‖2

Length of tangent is set by aninner product:

‖ds‖2 = ∑i,j

gijdsidsj

gij is the metric tensor.

For example, in Euclidean space, gij = δij, and ‖ds‖2 = dx21 + dx2

2 + · · ·

Riemannian Geometry

L(γ) =∫

√‖ds‖2

‖ds‖2 = ∑i,j

gijdsidsj

2 + · · ·

Riemannian Geometry

L(γ) =∫

√‖ds‖2

‖ds‖2 = ∑i,j

gijdsidsj

2 + · · ·

Riemannian Geometry

L(γ) =∫

√‖ds‖2

‖ds‖2 = ∑i,j

gijdsidsj

2 + · · ·

Riemannian Geometry

L(γ) =∫

√‖ds‖2

‖ds‖2 = ∑i,j

gijdsidsj

2 + · · ·

Fisher Information

Let p(x; θ) be a parametric family of distributions.Set s = (s1, s2, . . . sk)

>, si =∂ log p(x;θ)

∂θi.

gij = E[sisj] =∫

p(x; θ)∂ log p(x; θ)

∂θi

∂ log p(x; θ)

∂θjdx

= −∫

p(x; θ)∂2 log p(x; θ)

∂θj∂θidx

G = {gij} is the Fisher Information

Fisher information acts like a “curvature” of the manifoldHigh Fisher information implies easier estimation of θfrom data.Given a parametric family, Fisher information inducesmetric structure on manifold

Fisher Information

Let p(x; θ) be a parametric family of distributions.Set s = (s1, s2, . . . sk)

>, si =∂ log p(x;θ)

∂θi.

gij = E[sisj] =∫

p(x; θ)∂ log p(x; θ)

∂θi

∂ log p(x; θ)

∂θjdx

= −∫

p(x; θ)∂2 log p(x; θ)

∂θj∂θidx

G = {gij} is the Fisher Information

Fisher information acts like a “curvature” of the manifoldHigh Fisher information implies easier estimation of θfrom data.Given a parametric family, Fisher information inducesmetric structure on manifold

Example: Gaussian distributions

Consider {N(µ, σI) | µ ∈ Rd−1, σ ∈ R+}

log p(x; θ) = −d−1

∑l=1

(xl − µl)2

−∂2 log p(x; θ)

∂θj∂θi=

1σ2 δij, i, j < d

E[−∂2 log p(x; θ)

∂σ2 ] =2(d− 1)

After some rescaling,

gij =1σ2 δij

which induces d-dimensional hyperbolic space

Note: If σ = 1 we recover Euclidean geometry.

Example: Multinomials

Consider {p(x1, . . . , xd; n, θ1, . . . , θd) | ∑ θi = 1}

log p(x; θ) = ∑ xi log θi

∂2 log p(x; θ)

∂θj∂θi= − x

A few steps later...

∑ gijdsidsj = ∑i

By a standard transformation from the simplex to thesphere, this yields the Euclidean inner product !Geodesics between distributions are great circles on thesphere.Hellinger distance is now chordal distance on sphere.

From a metric to a distance

Metric tensor only gives infinitesimal distance (‖ds‖2).

To find shortest paths, we need to minimize path length

Structure What you getManifold Topology

Differentiability Tangent spaceMetric tensor Infinitesimal length

Affine connection Globally minimum paths

Metric tensor induces “natural” connectionIn statistical manifolds, many connections can be defined(parametrized by α)

Different values of α yield different Bregman divergences,f -divergences, α-divergences.

Metric tensor only gives infinitesimal distance (‖ds‖2).To find shortest paths, we need to minimize path length

A Rogues’ Gallery

Kullback-Leibler Distance

KL(p, q) = ∑i

pi logpi

The Jensen-Shannon Distance

JSα,β(p, q) = αKL(p, m) + βKL(q, m)

where m = αp + βq, α + β = 1χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

Hellinger-Matsusita-Bhattacharya Distance

dH(p, q) = [∑i(√

pi −√

A Rogues’ Gallery

KL(p, q) = ∑i

pi logpi

where m = αp + βq, α + β = 1

χ2-Distance

χ2(p, q) = ∑i

(pi − qi)2

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

dH(p, q) = [∑i(√

pi −√

A Rogues’ Gallery

KL(p, q) = ∑i

pi logpi

χ2(p, q) = ∑i

(pi − qi)2

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

dH(p, q) = [∑i(√

pi −√

A Rogues’ Gallery

KL(p, q) = ∑i

pi logpi

χ2(p, q) = ∑i

(pi − qi)2

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

dH(p, q) = [∑i(√

pi −√

A Rogues’ Gallery

KL(p, q) = ∑i

pi logpi

χ2(p, q) = ∑i

(pi − qi)2

∆-Distance

∆(p, q) = ∑i

(pi − qi)2

pi + qi

dH(p, q) = [∑i(√

pi −√

The Rogues’ Club

Bregman divergence For convex φ : Rd → R

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

α-divergence For |α| < 1,

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

f-divergence For convex f : R→ R, f (1) = 0,

Df (p, q) = ∑i

pif (qi

f = − log x, φ = x log x and α→ −1 all give KL(p, q).

The Rogues’ Club

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

Df (p, q) = ∑i

pif (qi

The Rogues’ Club

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

Df (p, q) = ∑i

pif (qi

The Rogues’ Club

Dφ(p, q) = φ(p)− φ(q)− 〈∇φ(q), p− q〉

Dα(p, q) =4

1− α2 [1−∫

p(1−α)/2q(1+łpha)/2]

Df (p, q) = ∑i

pif (qi

Invariance Properties and Cencov’s theorem

Euclidean Invariants

scaling

shearing rotation

14.64 mm

28.22 mm

14.64 mm

6.48 mm

Markov Transformations

A is a column-stochastic matrix if

∀j, ∑i

aij = 1

If p is a distribution, then so is Ap.Information cannot be increased: for any such A,

d(Ap, Aq) ≤ d(p, q)

Sufficient Statistics

Let X ∼ p(x; θ), and T be a transformation of X. T is a sufficientstatistic if

p(x; θ, T(X)) = p(x; T(X))

Example: Let X ∼ N (µ, σ), x1, . . . xn be samples, andT(X) = ( 1

n ∑ xi, 1n−1 ∑(x− 1

n ∑ xi)2).

Informally, a sufficient statistic captures all information in thedistribution.

Cencov’s theorem

TheoremThe Fisher information is the unique (modulo scaling) metric tensorthat remain invariant under Markov transformations that aresufficient.

Coming up

Informationgeometry

References I

Shun-Ichi Amari and Hiroshi Nagaoka.Methods of information geometry.Oxford University Press., 2000.

L. L. Campbell.An extended Cencov characterization of the informationmetric.Proc. Amer. Math. Soc., 98(1):135–141, 1986.

Guy Lebanon.Riemannian Geometry and Statistical Machine Learning.PhD thesis, CMU, 2005.

N. N. Cencov.Statistical Decision Rules and Optimal Inference.American Mathematical Society, 1982.Originally published in Russian, Nauka, 1972.

the geometry of distributions i classification of...

Documents