pattern learning and recognition on statistical manifolds: an information-geometric review

64
Pattern learning and recognition on statistical manifolds: An information-geometric review Frank Nielsen [email protected] www.informationgeometry.org Sony Computer Science Laboratories, Inc. July 2013, SIMBAD, York, UK. c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/64

Upload: frank-nielsen

Post on 11-May-2015

229 views

Category:

Technology


5 download

DESCRIPTION

Slides of the keynote talk at SIMBAD. See LNCS survey paper.

TRANSCRIPT

Page 1: Pattern learning and recognition on statistical manifolds: An information-geometric review

Pattern learning and recognition on

statistical manifolds:An information-geometric review

Frank [email protected]

www.informationgeometry.org

Sony Computer Science Laboratories, Inc.

July 2013, SIMBAD, York, UK.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/64

Page 2: Pattern learning and recognition on statistical manifolds: An information-geometric review

Praise for Computational Information Geometry

◮ What is Information? = Essence of data (datum=“thing”)(make it tangible → e.g., parameters of generative models)

◮ Can we do Intrinsic computing?(unbiased by any particular “data representation” → sameresults after recoding data)

◮ Geometry?!−→ Science of invariance

(mother of Science, compass & ruler, Descartesanalytic=coordinate/Cartesian, imaginaries, ...).The open-ended poetic mathematics...

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/64

Page 3: Pattern learning and recognition on statistical manifolds: An information-geometric review

Rationale for Computational Information Geometry

◮ Information is ...never void! → lower bounds

◮ Cramer-Rao lower bound and Fisher information (estimation)◮ Bayes error and Chernoff information (classification)◮ Coding and Shannon entropy (communication)◮ Program and Kolmogorov complexity (compression).

(Unfortunately not computable!)

◮ Geometry:◮ Language (point, line, ball, dimension, orthogonal, projection,

geodesic, immersion, etc.)◮ Power of characterization (eg., intersection of two

pseudo-segments not admitting closed-form expression)

◮ Computing: Information computing. Seeking for mathematicalconvenience and mathematical tricks (eg., kernel).Do you know the “space of functions” ?!?(Infinity and 1-to-1 mapping, language vs continuum)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/64

Page 4: Pattern learning and recognition on statistical manifolds: An information-geometric review

This talk: Information-geometric Pattern Recognition

◮ Focus onstatistical pattern recognition ↔ geometric computing.

◮ Consider probability distribution families (parameter spaces)and statistical manifolds.Beyond traditional Riemannian and Hilbert sphererepresentations (p → √

p)

◮ Describe dually flat spaces induced by convex functions:

◮ Legendre transformation → dual and mixed coordinate systems◮ Dual similarities/divergences◮ Computing-friendly dual affine geodesics

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/64

Page 5: Pattern learning and recognition on statistical manifolds: An information-geometric review

SIMBAD 2013

Information-geometric Pattern Recognition

By departing from vector-space representations one is confrontedwith the challenging problem of dealing with (dis)similarities thatdo not necessarily possess the Euclidean behavior ornot even obey the requirements of a metric. The lack of theEuclidean and/or metric properties undermines the veryfoundations of traditional pattern recognition theories andalgorithms, and poses totallynew theoretical/computational questions and challenges.

Thank you to Prof. Edwin Hancock (U. York, UK) and Prof.Marcello Pelillo (U. Venice, IT)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/64

Page 6: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical Pattern Recognition

Models data with distributions (generative) or stochastic processes:

◮ parametric (Gaussians, histograms) [model size ∼ D],

◮ semi-parametric (mixtures) [model size ∼ kD],

◮ non-parametric (kernel density estimators [model size ∼ n],Dirichlet/Gaussian processes [model size ∼ D log n],)

Data = Pattern (→information) + noise (independent)

Statistical machine learning hot topic: Deep learning(restricted Boltzmann machines)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/64

Page 7: Pattern learning and recognition on statistical manifolds: An information-geometric review

Example I: Information-geometric PR (I)Pattern = Gaussian mixture models (universal class)Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., pi(x) ∼ N(µi ,Σi), y = A(x) = Lx + t,yi ∼ N(Lµi + t, LΣiL

⊤), D(X1 : X2) = D(Y1 : Y2)(L: any invertible affine transformation, t a translation)

Shape Retrieval using Hierarchical Total Bregman Soft Clustering [11],

IEEE PAMI, 2012.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/64

Page 8: Pattern learning and recognition on statistical manifolds: An information-geometric review

Example II: Information-geometric PRDTI: diffusion ellipsoids, tensor interpolation.Pattern = zero-centered “Gaussians“Statistical (dis)similarity/distance: total Bregman divergence(tBD, tKL).Invariance: ..., D(A⊤PA : A⊤QA) = D(P : Q), A ∈ SL(d):orthogonal matrix(volume/orientation preserving)total Bregman divergence (tBD).

(3D rat corpus callosum)

Total Bregman Divergence and its Applications to DTI Analysis [34],

IEEE TMI. 2011.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/64

Page 9: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical mixtures: Generative models of data sets

GMM = feature descriptor for information retrieval (IR)→ classification [21], matching, etc.Increase dimension using color image patches.Low-frequency information encoded into compact statistical model.

Generative model → statistical image by GMM sampling.

→ A mixture∑k

i=1wiN(µi ,Σi) is interpreted as a weighted pointset in a parameter space: {wi , θi = (µi ,Σi)}ki=1.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/64

Page 10: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical invarianceRiemannian structure (M, g) on {p(x ; θ) | θ ∈ Θ ⊂ R

D}◮ θ-Invariance under non-singular parameterization:

ρ(p(x ; θ), p(x ; θ′)) = ρ(p(x ;λ(θ)), p(x ;λ(θ′)))

Normal parameterization (µ, σ) or (µ, σ2) yields same distance

◮ x-Invariance under different x-representation:Sufficient statistics (Fisher, 1922):

Pr(X = x |t(X ) = t, θ) = Pr(X = x |T (X ) = t)

All information for θ is contained in T .→ Lossless information data reduction (exponential families).Markov kernel = statistical morphism (Chentsov 1972,[6, 7]).A particular Markov kernel is a deterministic mappingT : X → Y with y = T (x), py = pxT

−1.

Invariance if and only if g ∝ Fisher information matrix

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/64

Page 11: Pattern learning and recognition on statistical manifolds: An information-geometric review

f -divergences (1960’s)A statistical non-metric distance between two probability measures:

If (p : q) =

f

(p(x)

q(x)

)

q(x)dx

f : continuous convex function with f (1) = 0, f ′(1) = 0, f ′′(1) = 1.→ asymmetric (not a metric, except TV), modulo affine term.→ can always be symmetrized using s = f + f ∗, withf ∗(x) = xf (1/x).include many well-known statistical measures: Kullback-Leibler,α-divergences, Hellinger, Chi squared, total variation (TV), etc.

f -divergences are the only statistical divergences that preservesequivalence wrt. sufficient statistic mapping:

If (p : q) ≥ If (pM : qM)

with equality if and only if M = T (monotonicity property).c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/64

Page 12: Pattern learning and recognition on statistical manifolds: An information-geometric review

Information geometry: Dually flat spaces, α-geometry

Statistical invariance also obtained using (M, g ,∇,∇∗) where

∇ and ∇∗ are dual affine connections.

Riemannian structure (M, g) is particular case for ∇ = ∇∗ = ∇0,Levi-Civita connection: (M, g) = (M, g ,∇(0),∇(0))

Dually flat space (α = ±1) are algorithmically-friendly:

◮ Statistical mixtures of exponential families

◮ Learning & simplifying mixtures (k-MLE)

◮ Bregman Voronoi diagrams & dually ⊥ triangulations

Goal: Algorithmics of Gaussians/histograms wrt. Kullback-Leibler divergence.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/64

Page 13: Pattern learning and recognition on statistical manifolds: An information-geometric review

Exponential Family Mixture Models (EFMMs)Generalize Gaussian & Rayleigh MMs to many usual distributions.

m(x) =

k∑

i=1

wipF (x ;λi ) with ∀i wi > 0,∑k

i=1 wi = 1

pF (x ;λ) = e〈t(x),θ〉−F (θ)+k(x)

F : log-Laplace transform (partition, cumulant function):

F (θ) = log

x∈Xe〈t(x),θ〉+k(x)dx ,

θ ∈ Θ =

{

θ

∣∣∣∣

x∈Xe〈t(x),θ〉+k(x)dx < ∞

}

the natural parameter space.

◮ d : Dimension of the support X .

◮ D: order of the family (= dimΘ). Statistic: t(x) : Rd → RD .

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/64

Page 14: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical mixtures: Rayleigh MMs [33, 22]IntraVascular UltraSound (IVUS) imaging:

Rayleigh distribution:

p(x ;λ) = xλ2 e

− x2

2λ2

x ∈ R+

d = 1 (univariate)D = 1 (order 1)θ = − 1

2λ2

Θ = (−∞, 0)F (θ) = − log(−2θ)t(x) = x2

k(x) = log x(Weibull k = 2)

Coronary plaques: fibrotic tissues, calcified tissues, lipidic tissuesRayleigh Mixture Models (RMMs):for segmentation and classification tasks

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/64

Page 15: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical mixtures: Gaussian MMs [8, 22, 9]Gaussian mixture models (GMMs): model low frequency.Color image interpreted as a 5D xyRGB point set.

Gaussian distribution p(x ;µ,Σ):1

(2π)d2√

|Σ|e−

12DΣ−1 (x−µ,x−µ)

Squared Mahalanobis distance:DQ(x , y) = (x − y)TQ(x − y)x ∈ R

d

d (multivariate)

D = d(d+3)2 (order)

θ = (Σ−1µ, 12Σ−1) = (θv , θM)

Θ = R× Sd++

F (θ) = 14θ

Tv θ

−1M θv − 1

2 log |θM | +d2 log πt(x) = (x ,−xxT )k(x) = 0

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/64

Page 16: Pattern learning and recognition on statistical manifolds: An information-geometric review

Sampling from a Gaussian Mixture ModelTo sample a variate x from a GMM:

◮ Choose a component l according to the weight distributionw1, ...,wk ,

◮ Draw a variate x according to N(µl ,Σl).

→ Sampling is a doubly stochastic process:

◮ throw a biased dice with k faces to choose the component:

l ∼ Multinomial(w1, ...,wk)

(Multinomial is also an EF, normalized histogram.)

◮ then draw at random a variate x from the l -th component

x ∼ Normal(µl ,Σl)

x = µ+ Cz with Cholesky: Σ = CCT and z = [z1 ... zd ]T

standard normal random variate:zi =√−2 logU1 cos(2πU2)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/64

Page 17: Pattern learning and recognition on statistical manifolds: An information-geometric review

Relative entropy for exponential families◮ Distance between features (e.g., GMMs)◮ Kullback-Leibler divergence (cross-entropy minus entropy):

KL(P : Q) =

p(x) logp(x)

q(x)dx ≥ 0

=

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

= F (θQ)− F (θP)− 〈θQ − θP ,∇F (θP)〉= BF (θQ : θP)

Bregman divergence BF defined for a strictly convex anddifferentiable function up to some affine terms.

◮ Proof KL(P : Q) = BF (θQ : θP) follows from

X ∼ EF (θ) =⇒ E [t(X )] = ∇F (θ)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/64

Page 18: Pattern learning and recognition on statistical manifolds: An information-geometric review

Convex duality: Legendre transformation

◮ For a strictly convex and differentiable function F : X → R:

F ∗(y) = supx∈X

{〈y , x〉 − F (x)︸ ︷︷ ︸

lF (y ;x);

}

◮ Maximum obtained for y = ∇F (x):

∇x lF (y ; x) = y −∇F (x) = 0 ⇒ y = ∇F (x)

◮ Maximum unique from convexity of F (∇2F ≻ 0):

∇2x lF (y ; x) = −∇2F (x) ≺ 0

◮ Convex conjugates:

(F ,X ) ⇔ (F ∗,Y), Y = {∇F (x) | x ∈ X}

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/64

Page 19: Pattern learning and recognition on statistical manifolds: An information-geometric review

Legendre duality: Geometric interpretationConsider the epigraph of F as a convex object:

◮ convex hull (V -representation), versus◮ half-space (H-representation).

OF

z

x

P : (x, F (x))

(0, F (xP )− xPF′(xP ) = −F ∗(yP ))

HP : z = (x− xP )F′(xP ) + F (xP )

Q

xP

zP = F (xP )

HQ : z = (x− xQ)F′(p) + F (xQ)

Dual coordinate systems:

P =

xPHP : yP = F ′(xP )

0

HP+

Legendre transform also called “slope” transform.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/64

Page 20: Pattern learning and recognition on statistical manifolds: An information-geometric review

Legendre duality & Canonical divergence

◮ Convex conjugates have functional inverse gradients∇F−1 = ∇F ∗

∇F ∗ may require numerical approximation(not always available in analytical closed-form)

◮ Involution: (F ∗)∗ = F with ∇F ∗ = (∇F )−1.

◮ Convex conjugate F ∗ expressed using (∇F )−1:

F ∗(y) = 〈x , y〉 − F (x), x = ∇yF∗(y)

= 〈(∇F )−1(y), y〉 − F ((∇F )−1(y))

◮ Fenchel-Young inequality at the heart of canonical divergence:

F (x) + F ∗(y) ≥ 〈x , y〉

AF (x : y) = AF∗(y : x) = F (x) + F ∗(y)− 〈x , y〉 ≥ 0

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/64

Page 21: Pattern learning and recognition on statistical manifolds: An information-geometric review

Dual Bregman divergences & canonical divergence [26]

KL(P : Q) = EP

[

logp(x)

q(x)

]

≥ 0

= BF (θQ : θP) = BF∗(ηP : ηQ)

= F (θQ) + F ∗(ηP)− 〈θQ , ηP〉= AF (θQ : ηP) = AF∗(ηP : θQ)

with θQ (natural parameterization) and ηP = EP [t(X )] = ∇F (θP)(moment parameterization).

KL(P : Q) =

p(x) log1

q(x)dx

︸ ︷︷ ︸

H×(P:Q)

−∫

p(x) log1

p(x)dx

︸ ︷︷ ︸

H(p)=H×(P:P)

Shannon cross-entropy and entropy of EF [26]:

H×(P : Q) = F (θQ)− 〈θQ ,∇F (θP)〉 − EP [k(x)]

H(P) = F (θP)− 〈θP ,∇F (θP)〉 − EP [k(x)]

H(P) = −F ∗(ηP)− EP [k(x)]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/64

Page 22: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman divergence: Geometric interpretation (I)

Potential function F , graph plot F : (x ,F (x)).

DF (p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/64

Page 23: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman divergence: Geometric interpretation (III)Bregman divergence and path integrals

B(θ1 : θ2) = F (θ1)− F (θ2)− 〈θ1 − θ2,∇F (θ2)〉, (1)

=

∫ θ1

θ2

〈∇F (t)−∇F (θ2),dt〉, (2)

=

∫ η2

η1

〈∇F ∗(t)−∇F ∗(η1),dt〉, (3)

= B∗(η2 : η1) (4)

θ

η = ∇F (θ)

θ2 θ1

η2

η1

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/64

Page 24: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman divergence: Geometric interpretation (II)

Potential function f , graph plot F : (x , f (x)).

Bf (p||q) = f (p)− f (q)− (p − q)f ′(q)

F

Xpq

p

q

Hq

Bf (p||q)

Bf (.||q): vertical distance between the hyperplane Hq tangent toF at lifted point q, and the translated hyperplane at p.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/64

Page 25: Pattern learning and recognition on statistical manifolds: An information-geometric review

total Bregman divergence (tBD)

By analogy to least squares and total least squarestotal Bregman divergence (tBD) [11, 34, 12]

δf (x , y) =bf (x , y)

1 + ‖∇f (y)‖2

Proved statistical robustness of tBD.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/64

Page 26: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman sided centroids [25, 21]Bregman centroids = unique minimizers of average Bregmandivergences (BF convex in right argument)

θ = argminθ1

n

n∑

i=1

BF (θi : θ)

θ′ = argminθ1

n

n∑

i=1

BF (θ : θi )

θ =1

n

n∑

i=1

θi , center of mass, independent of F

θ′ = (∇F )−1

(

1

n

n∑

i=1

(∇F )(θi )

)

→ Generalized Kolmogorov-Nagumo f -means.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/64

Page 27: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman divergences BF and ∇F -means

Bijection quasi-arithmetic means (∇F ) ⇔ Bregman divergence BF .Bregman divergence BF F ←→ f = F ′ f−1 = (F ′)−1 f -mean(entropy/loss function F ) (Generalized means)

Squared Euclidean distance 12x2 ←→ x x Arithmetic mean

(half squared loss)∑n

j=11nxj

Kullback-Leibler divergence x log x − x ←→ log x exp x Geometric mean

(Ext. neg. Shannon entropy) (∏n

j=1 xj )1n

Itakura-Saito divergence − log x ←→ − 1x

− 1x

Harmonic mean(Burg entropy) n

∑nj=1

1xj

∇F strictly increasing (like cumulative distribution functions)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/64

Page 28: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman sided centroids [25]

Two sided centroids C and C ′ expressed using two θ/η coordinatesystems: = 4 equations.

C : θ , η′

C ′ : θ′, η

C : θ =1

n

n∑

i=1

θi

η′ = ∇F (θ)

C ′ : η =1

n

n∑

i=1

ηi

θ′ = ∇F ∗(η)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/64

Page 29: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman centroids and Jacobian fields

Centroid θ zeroes the left-sided Jacobian vector field:

∇θ

(∑

t

wtA(θ : ηt)

)

Sum of tangent vectors of geodesics from centroid to points = 0

∇θA(θ : η′) = η − η′

∇θ(∑

t

wtA(θ : ηt)) = η − η

since∑

t wt = 1, with η =∑

t wtηt .

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/64

Page 30: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman information [25]Bregman information = minimum of loss function

IF (P) =1

n

n∑

i=1

BF (θi : θ)

=1

n

n∑

i=1

F (θi)− F (θ)− 〈θi − θ,∇F (θ)〉

=1

n

n∑

i=1

F (θi)− F (θ)−⟨

1

n

n∑

i=1

θi − θ

︸ ︷︷ ︸

=0

,∇F (θ)

= JF (θ1, ..., θn)

Jensen diversity index (e.g., Jensen-Shannon for F (x) = x log x)

◮ For squared Euclidean distance, Bregman information =cluster variance,

◮ For Kullback-Leibler divergence, Bregman information relatedto mutual information.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/64

Page 31: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman k-means clustering [4]

Bregman k-means: Find k centers C = {C1, ...,Ck} that minimizesthe loss function:

LF (P : C) =∑

P∈P

BF (P : C)

BF (P : C) = mini∈{1,...,k}

BF (P : Ci )

→ generalize Lloyd’ s quadratic error in Vector Quantization (VQ)

LF (P : C) = IF (P)− IF (C)

IF (P) → total Bregman informationIF (C) → between-cluster Bregman informationLF (P : C) → within-cluster Bregman information

total Bregman information = within-cluster Bregman information + between-cluster Bregman information

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/64

Page 32: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman k-means clustering [4]

IF (P) = LF (P : C) + IF (C)Bregman clustering amounts to find the partition C∗that minimizesthe information loss:

LF∗ = LF (P : C∗) = min

C(IF (P) − IF (C))

Bregman k-means :

◮ Initialize distinct seeds: C1 = P1, ...,Ck = Pk

◮ Repeat until convergence◮ Assign point Pi to its closest centroid:

Ci = {P ∈ P | BF (P : Ci ) ≤ BF (P : Cj) ∀j 6= i}◮ Update cluster centroids by taking their center of mass:

Ci =1

|Ci |

P∈CiP .

Loss function monotonically decreases and converges to a localoptimum. (Extend to weighted point sets using barycenters.)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/64

Page 33: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman k-means++ [1]: Careful seeding (only?!)

(also called Bregman k-medians since min∑

i B1F (pi : x)).

Extend the D2-initialization of k-means++

Only seeding stage yields probabilistically guaranteed globalapproximation factor:

Bregman k-means++:

◮ Choose C = {Cl} for l uniformly random in {1, ..., n}◮ While |C| < k

◮ Choose P ∈ P with probabilityBF (P:C)∑ni=1 BF (Pi :C)

= BF (P:C)LF (P:C)

→ Yields a O(log k) approximation factor (with high probability).Constant in O(·) depends on ratio of min/max ∇2F .

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/64

Page 34: Pattern learning and recognition on statistical manifolds: An information-geometric review

Anisotropic Voronoi diagram (for MVN MMs) [10, 14]Learning mixtures, Talk of O. Schwander on EM, k-MLEFrom the source color image (a), we buid a 5D GMM with k = 32components, and color each pixel with the mean color of theanisotropic Voronoi cell it belongs to. (∼ weighted squaredMahalanobis distance per center)

log pF (x ; θi ) ∝ −BF∗(t(x) : ηi ) + k(x)

Most likely component segmentation ≡ Bregman Voronoi diagram(squared Mahalanobis for Gaussians)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/64

Page 35: Pattern learning and recognition on statistical manifolds: An information-geometric review

Voronoi diagrams in dually flat spaces...

Voronoi diagram, dual ⊥ Delaunay triangulation (general position)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/64

Page 36: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman dual bisectors: Hyperplanes &

hypersurfaces [5, 24, 27]

Right-sided bisector: → Hyperplane (θ-hyperplane)

HF (p, q) = {x ∈ X | BF (x : p) = BF (x : q)}.HF :

〈∇F (p)−∇F (q), x〉 + (F (p)− F (q) + 〈q,∇F (q)〉 − 〈p,∇F (p)〉) = 0

Left-sided bisector: → Hypersurface (η-hyperplane)

H ′F (p, q) = {x ∈ X | BF (p : x) = BF (q : x)}

H ′F : 〈∇F (x), q − p〉+ F (p)− F (q) = 0

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/64

Page 37: Pattern learning and recognition on statistical manifolds: An information-geometric review

Visualizing Bregman bisectors

Primal coordinates θ Dual coordinates ηnatural parameters expectation parameters

p

q

Source Space: Logistic loss

p(0.87337870,0.14144719) q(0.92858669,0.61296731)

D(p,q)=0.49561129 D(q,p)=0.60649981 p’

q’

Gradient Space: Bernouilli

p’(1.93116855,-1.80332178) q’(2.56517944,0.45980247)

D*(p’,q’)=0.60649981 D*(q’,p’)=0.49561129

p

qSource Space: Itakura-Saito

p(0.52977081,0.72041688) q(0.85824458,0.29083834)

D(p,q)=0.66969016 D(q,p)=0.44835617

p’

q’

Gradient Space: Itakura-Saito dual

p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)

D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/64

Page 38: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman Voronoi diagrams as minimization diagrams [5]A subclass of affine diagrams which have all non-empty cells .Minimization diagram of the n functionsDi(x) = BF (x : pi) = F (x)− F (pi )− 〈x − pi ,∇F (pi )〉.≡ minimization of n linear functions:

Hi (x) = (pi − x)T∇F (qi )− F (pi )

⇐⇒c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/64

Page 39: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bregman dual Delaunay triangulations

Delaunay Exponential Hellinger-like

◮ empty Bregman sphere property,

◮ geodesic triangles.

BVDs extends Euclidean Voronoi diagrams with similar complexity/algorithms.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/64

Page 40: Pattern learning and recognition on statistical manifolds: An information-geometric review

Non-commutative Bregman Orthogonality3-point property (generalized law of cosines):

BF (p : r) = BF (p : q) + BF (q : r)− (p − q)T (∇F (r)−∇F (q)))

P

QR

DF (P ||R)

DF (Q||R)

DF (P ||Q)

Γ∗

PQ

ΓQR

(pq)θ Bregman orthogonal to (qr)η iff.

BF (p : r) = BF (p : q) + BF (q : r)

(Equivalent to 〈θp − θq, ηr − ηq〉 = 0)Extend Pythagoras theorem

(pq)θ ⊥F (qr)η

→ ⊥F is not commutative...... except in the squared Euclidean/Mahalanobis case,F (x) = 1 〈x , x〉.c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/64

Page 41: Pattern learning and recognition on statistical manifolds: An information-geometric review

Dually orthogonal Bregman Voronoi & TriangulationsOrdinary Voronoi diagram is perpendicular to Delaunaytriangulation.Dual line segment geodesics:

(pq)θ = {θ = θp + (1− λ)θq |λ ∈ [0, 1]}(pq)η = {η = ηp + (1− λ)ηq |λ ∈ [0, 1]}

Bisectors:

Bθ(p, q) : 〈x , θq − θp〉+ F (θp)− F (θq) = 0

Bη(p, q) : 〈x , ηq − ηp〉+ F ∗(ηp)− F ∗(ηq) = 0

Dual orthogonality:

Bη(p, q) ⊥ (pq)η

(pq)θ ⊥ Bθ(p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/64

Page 42: Pattern learning and recognition on statistical manifolds: An information-geometric review

Dually orthogonal Bregman Voronoi & Triangulations

Bη(p, q) ⊥ (pq)η

(pq)θ ⊥ Bθ(p, q)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/64

Page 43: Pattern learning and recognition on statistical manifolds: An information-geometric review

Simplifying mixture: Kullback-Leibler projection theoremAn exponential family mixture model p =

∑ki=1 wipF (x ; θi )

Right-sided KL barycenter p∗ of components interpreted as theprojection of the mixture model p ∈ P onto the model exponentialfamily manifold EF [32]:

p∗ = arg minp∈EF

KL(p : p)

P

p∗p

exponential family

mixture sub-manifold

manifold of probability distribution

m-geodesic

EF

p

e-geodesic

Right-sided KL centroid = Left-sided Bregman centroidc© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/64

Page 44: Pattern learning and recognition on statistical manifolds: An information-geometric review

A simple proof

argminθKL(m(x) : pθ(x)) = Em[logm]− Em[log pθ]

= Em[logm]− Em[k(x)] − Em[〈t(x), θ〉 − F (θ)]

≡ argmaxθEm[〈t(x), θ〉] − F (θ)

= argmaxθ〈Em[t(x)], θ〉]− F (θ)

Em[t(x)] =∑

t

wtEpθt[t(x)] =

t

wtηt = η

∇F (θ) = η = η

θ = ∇F ∗(η)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/64

Page 45: Pattern learning and recognition on statistical manifolds: An information-geometric review

Left-sided or right-sided Kullback-Leibler centroids?Left/right Bregman centroids=Right/left entropic centroids (KL of exp. fam.)

Left-sided/right-sided centroids: different (statistical) properties:

◮ Right-sided entropic centroid: zero-avoiding (cover support ofpdfs.)

◮ Left-sided entropic centroid: zero-forcing (captures highestmode).

N1 = N (−4, 4)

N2 = (5, 0.64)Left-side Kullback-Leibler centroid

(zero-forcing)

Right-sided Kullback-Leibler centroidzero-avoiding

Symmetrized centroid

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/64

Page 46: Pattern learning and recognition on statistical manifolds: An information-geometric review

Hierarchical clustering of GMMs (Burbea-Rao)

Hierarchical clustering of GMMs wrt. Bhattacharyya distance.Simplify the number of components of an initial GMM.

(a) source

(b) k = 48

(c) k = 16

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/64

Page 47: Pattern learning and recognition on statistical manifolds: An information-geometric review

Two symmetrizations of Bregman divergences

◮ Jeffreys-Bregman divergences.

SF (p; q) =BF (p, q) + BF (q, p)

2

=1

2〈p − q,∇F (p)−∇F (q)〉,

◮ Jensen-Bregman divergences (diversity index).

JF (p; q) =BF (p,

p+q2 ) + BF (q,

p+q2 )

2

=F (p) + F (q)

2− F

(p + q

2

)

= BRF (p, q)

Skew Jensen divergence [21, 29]

J(α)F (p; q) = αF (p)+(1−α)F (q)−F (αp+(1−α)q) = BR

(α)F (p; q)

(Jeffreys and Jensen-Shannon symmetrization of Kullback-Leibler)c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/64

Page 48: Pattern learning and recognition on statistical manifolds: An information-geometric review

Burbea-Rao centroids (α-skewed Jensen centroids)

Minimum average divergence:

OPT : c = argminx

n∑

i=1

wiJ(α)F (x , pi ) = argmin

xL(x)

Equivalent to minimize:

E (c) = (

n∑

i=1

wiα)F (c) −n∑

i=1

wiF (αc + (1− α)pi )

Sum E = F + G of convex F + concave G function ⇒Convex-ConCave Procedure (CCCP)Start from arbitrary c0, and iteratively update as:

∇F (ct+1) = −∇G (ct)

⇒ guaranteed convergence to a local minimum.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/64

Page 49: Pattern learning and recognition on statistical manifolds: An information-geometric review

ConCave Convex Procedure (CCCP)minx E (x) = F (x) + G (x)∇F (ct+1) = −∇G (ct)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/64

Page 50: Pattern learning and recognition on statistical manifolds: An information-geometric review

Iterative algorithm for Burbea-Rao centroids

Apply CCCP scheme

∇F (ct+1) =n∑

i=1

wi∇F (αct + (1− α)pi )

ct+1 = ∇F−1

(n∑

i=1

wi∇F (αct + (1− α)pi )

)

Get arbitrarily fine approximations of the (skew) Burbea-Raocentroids and barycenters.

Unique GLOBAL minimum when divergence is separable [21].

Unique GLOBAL minimum for matrix mean [23] for the logDetdivergence.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/64

Page 51: Pattern learning and recognition on statistical manifolds: An information-geometric review

Information-geometric computing on statistical manifolds

◮ Cramer-Rao Lower Bound and Information Geometry [13](overview article)

◮ Chernoff information on statistical exponential familymanifold [19]

◮ Hypothesis testing and Bregman Voronoi diagrams [18]

◮ Jeffreys centroid [20]

◮ Learning mixtures with k-MLE [17]

◮ Closed-form divergences for statistical mixtures [16]

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/64

Page 52: Pattern learning and recognition on statistical manifolds: An information-geometric review

Summary

Information-geometric pattern recognition:

◮ Statistical manifold (M, g): Rao’s distance and Fisher-Raocurved Riemannian geometry.

◮ Statistical manifold (M, g ,∇,∇∗): dually flat spaces,Bregman divergences, geodesics are straight lines in either θ/ηparameter space. ±1-geometry, special case of α-geometry

◮ Clustering in dually flat spaces

◮ Software library: jMEF [8] (Java), pyMEF [31] (Python)

◮ ... but also many other geometries to explore: Hilbertian,Finsler [3], Kahler, Wasserstein, Contact, Symplectic, etc. (itis easy to require non-Euclidean geometry but then space iswild open!)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/64

Page 53: Pattern learning and recognition on statistical manifolds: An information-geometric review

Edited book (MIG) and proceedings (GSI)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/64

Page 54: Pattern learning and recognition on statistical manifolds: An information-geometric review

Thank you.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/64

Page 55: Pattern learning and recognition on statistical manifolds: An information-geometric review

Exponential families & statistical distancesUniversal density estimators [2] generalizing Gaussians/histograms(single EF density approximates any smooth density)Explicit formula for

◮ Shannon entropy, cross-entropy, and Kullback-Leiblerdivergence [26]:

◮ Renyi/Tsallis entropy and divergence [28]◮ Sharma-Mittal entropy and divergence [30]. A 2-parameter

family extending extensive Renyi (for β → 1) andnon-extensive Tsallis entropies (for β → α)

Hα,β(p) =1

1− β

((∫

p(x)αdx

) 1−β

1−α

− 1

)

,

with α > 0, α 6= 1, β 6= 1.◮ Skew Jensen and Burbea-Rao divergence [21]◮ Chernoff information and divergence [15]◮ Mixtures: total Least square, Jensen-Renyi, Cauchy-Schwarz

divergence [16].c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/64

Page 56: Pattern learning and recognition on statistical manifolds: An information-geometric review

Statistical invariance: Markov kernel

Probability family: p(x ; θ).

(X , σ) and (X ′, σ′) two measurable spaces.σ: A σ-algebra on X(non-empty, closed under complementation and countable union).

Markov kernel = transition probability kernelK : X × σ′ → [0, 1]:

◮ ∀E ′ ∈ σ′,K (·,E ′) measurable map,

◮ ∀x ∈ X ,K (x , ·) is a probability measure on (X ′, σ′).

p a pm. on (X , σ) induces Kp a pm., with

Kp(E ′) =

XK (x ,E ′)p(dx),∀E ′ ⊂ σ′

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/64

Page 57: Pattern learning and recognition on statistical manifolds: An information-geometric review

Space of Bregman spheres and Bregman balls [5]

Dual Bregman balls (bounding Bregman spheres):

BallrF (c , r) = {x ∈ X | BF (x : c) ≤ r}and BalllF (c , r) = {x ∈ X | BF (c : x) ≤ r}

Legendre duality:

BalllF (c , r) = (∇F )−1(BallrF∗(∇F (c), r))

Illustration for Itakura-Saito divergence, F (x) = − log x

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/64

Page 58: Pattern learning and recognition on statistical manifolds: An information-geometric review

Space of Bregman spheres: Lifting map [5]F : x 7→ x = (x ,F (x)), hypersurface in R

d+1.Hp : Tangent hyperplane at p, z = Hp(x) = 〈x − p,∇F (p)〉+ F (p)

◮ Bregman sphere σ −→ σ with supporting hyperplaneHσ : z = 〈x − c ,∇F (c)〉 + F (c) + r .(// to Hc and shifted vertically by r)σ = F ∩ Hσ.

◮ intersection of any hyperplane H with F projects onto X as aBregman sphere:

H : z = 〈x , a〉+b → σ : BallF (c = (∇F )−1(a), r = 〈a, c〉−F (c)+b)

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/64

Page 59: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references IMarcel R. Ackermann and Johannes Blomer.

Bregman clustering for separable instances.

In Scandinavian Workshop on Algorithm Theory (SWAT), pages 212–223, 2010.

Yasemin Altun, Alexander J. Smola, and Thomas Hofmann.

Exponential families for conditional random fields.

In Uncertainty in Artificial Intelligence (UAI), pages 2–9, 2004.

Marc Arnaudon and Frank Nielsen.

Medians and means in Finsler geometry.

LMS Journal of Computation and Mathematics, 15, 2012.

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.

Clustering with Bregman divergences.

Journal of Machine Learning Research, 6:1705–1749, 2005.

Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.

Bregman Voronoi diagrams.

Discrete & Computational Geometry, 44(2):281–307, 2010.

Nikolai Nikolaevich Chentsov.

Statistical Decision Rules and Optimal Inferences.

Transactions of Mathematics Monograph, numero 53, 1982.

Published in russian in 1972.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/64

Page 60: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references IIJose Manuel Corcuera and Federica Giummole.

A characterization of monotone and regular divergences.

Annals of the Institute of Statistical Mathematics, 50(3):433–450, 1998.

Vincent Garcia and Frank Nielsen.

Simplification and hierarchical representations of mixtures of exponential families.

Signal Processing (Elsevier), 90(12):3197–3212, 2010.

Vincent Garcia, Frank Nielsen, and Richard Nock.

Levels of details for Gaussian mixture models.

In Asian Conference on Computer Vision (ACCV), volume 2, pages 514–525, 2009.

Francois Labelle and Jonathan Richard Shewchuk.

Anisotropic Voronoi diagrams and guaranteed-quality anisotropic mesh generation.

In Proceedings of the nineteenth annual symposium on Computational geometry, SCG ’03, pages 191–200,New York, NY, USA, 2003. ACM.

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.

Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to shape retrieval.

In International Conference on Computer Vision (CVPR), pages 3463–3468, 2010.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/64

Page 61: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references III

Frank Nielsen.

Cramer-rao lower bound and information geometry.

In Rajendra Bhatia, C. S. Rajan, and Ajit Iqbal Singh, editors, Connected at Infinity II: A selection ofmathematics by Indians. Hindustan Book Agency (Texts and Readings in Mathematics, TRIM).

arxiv 1301.3578.

Frank Nielsen.

Visual Computing: Geometry, Graphics, and Vision.

Charles River Media / Thomson Delmar Learning, 2005.

Frank Nielsen.

Chernoff information of exponential families.

arXiv, abs/1102.2684, 2011.

Frank Nielsen.

Closed-form information-theoretic divergences for statistical mixtures.

In International Conference on Pattern Recognition (ICPR), pages 1723–1726, 2012.

Frank Nielsen.

k-MLE: A fast algorithm for learning statistical mixture models.

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2012.

preliminary, technical report on arXiv.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/64

Page 62: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references IV

Frank Nielsen.

Hypothesis testing, information divergence and computational geometry.

In F. Nielsen and F. Barbaresco, editors, Geometric Science of Information (GSI), volume 8085 of LNCS,pages 241–248. Springer, 2013.

Frank Nielsen.

An information-geometric characterization of Chernoff information.

IEEE Signal Processing Letters, 20(3):269–272, 2013.

Frank Nielsen.

Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximationfor frequency histograms.

Signal Processing Letters, IEEE, PP(99):1–1, 2013.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.

IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

Frank Nielsen and Vincent Garcia.

Statistical exponential families: A digest with flash cards, 2009.

arXiv.org:0911.4863.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/64

Page 63: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references VFrank Nielsen, Meizhu Liu, Xiaojing Ye, and Baba C. Vemuri.

Jensen divergence based SPD matrix means and applications.

In International Conference on Pattern Recognition (ICPR), 2012.

Frank Nielsen and Richard Nock.

The dual Voronoi diagrams with respect to representational Bregman divergences.

In International Symposium on Voronoi Diagrams (ISVD), pages 71–78, 2009.

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.

IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.

Frank Nielsen and Richard Nock.

Entropies and cross-entropies of exponential families.

In International Conference on Image Processing (ICIP), pages 3621–3624, 2010.

Frank Nielsen and Richard Nock.

Hyperbolic Voronoi diagrams made easy.

In International Conference on Computational Science and its Applications (ICCSA), volume 1, pages74–80, Los Alamitos, CA, USA, march 2010. IEEE Computer Society.

Frank Nielsen and Richard Nock.

On renyi and tsallis entropies and divergences for exponential families.

arXiv, abs/1105.3259, 2011.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/64

Page 64: Pattern learning and recognition on statistical manifolds: An information-geometric review

Bibliographic references VI

Frank Nielsen and Richard Nock.

Skew Jensen-Bregman Voronoi diagrams.

Transactions on Computational Science, 14:102–128, 2011.

Frank Nielsen and Richard Nock.

A closed-form expression for the Sharma-Mittal entropy of exponential families.

Journal of Physics A: Mathematical and Theoretical, 45(3), 2012.

Olivier Schwander and Frank Nielsen.

PyMEF - A framework for exponential families in Python.

In IEEE/SP Workshop on Statistical Signal Processing (SSP), 2011.

Olivier Schwander and Frank Nielsen.

Learning mixtures by simplifying kernel density estimators.

In Frank Nielsen and Rajendra Bhatia, editors, Matrix Information Geometry, pages 403–426, 2012.

Jose Seabra, Francesco Ciompi, Oriol Pujol, Josepa Mauri, Petia Radeva, and Joao Sanchez.

Rayleigh mixture model for plaque characterization in intravascular ultrasound.

IEEE Transaction on Biomedical Engineering, 58(5):1314–1324, 2011.

Baba Vemuri, Meizhu Liu, Shun ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis.

IEEE Transactions on Medical Imaging, 2011.

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/64